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1.  Executive  Summary 

The  DIVA  (Data  Intensive  Arehiteeture)  Projeet  has  developed  a  prototype  workstation  elass 
system  using  VLSI  PIM  (Proeessor-In-Memory)  ehips  as  smart-memory  eoproeessors  to  a 
eonventional  mieroproeessor.  These  ehips  represent  the  first  smart  memory  deviees  to  support 
virtual  addressing  and  be  eapable  of  exeeuting  multiple  threads  of  eontrol.  The  DIVA  PIM  VLSI 
is  fabrieated  in  TSMC  0.18-mieron  teehnology.  The  ehip  measures  9.8  mm  on  a  side  and 
contains  55  million  transistors. 

The  successful  demonstration  of  the  DIVA  prototype  system  incorporating  this  chip  involved 
research  in  several  areas  including:  System  Architecture,  Software  System  Architecture,  PIM 
Architecture,  VLSI  Architecture/Implementation,  Emulator  Architecture/Design  and  the  actual 
development  of  the  prototype  system  hardware  and  software.  These  areas  involved  teams  made 
up  of  staff  from  US  C/Information  Sciences  Institute,  the  University  of  Notre  Dame,  Caltech,  the 
University  of  Delaware  and  AlphaTech,  Inc. 

The  goals  of  the  DIVA  Project  were  to  demonstrate  the  capabilities  of  PIM  technology  as  smart 
memory  in  a  system: 

•  Exploit  the  inherent  memory  bandwidth 

o  embedded  DRAM  technology 

•  Cover  a  broad  range  of  applications: 

o  irregular  memory  accesses  (sparse-matrices  &  pointers) 
o  image  processing  and  multimedia  (streaming  computations) 

•  Evolutionary  application  migration  path 

o  PIMs  also  support  standard  memory  accesses 
o  familiar  parallel  programming  paradigm 

•  Prototype  a  workstation-class  system 

o  VESI  PIM  chips  in  standard  memory  modules 

All  these  goals  were  met.  The  projected  peak  performance  on  a  DIVA  system  with  32  PIMs  is  40 
GOPS,  with  an  aggregate  memory  bandwidth  of  160  Gbytes/second.  This  is  more  than  two 
orders  of  magnitude  bandwidth  increase  over  conventional  systems  meeting  the  DIS  (Data 
Intensive  Systems)  goal.  A  35-x  speedup  on  the  “comerturn”  benchmark,  a  matrix  transpose 
kernel  function  found  in  many  data  intensive  DoD  applications,  was  also  demonstrated. 

The  DIVA  VESI  PIM  developed  under  the  DARPA  Data  Intensive  Systems  (DIS)  Program  is 
proving  to  be  effective  in  ameliorating  the  processor-memory  bottleneck  present  in  most  of 
today's  computing  systems.  In  addition,  DIVA  PIM  technology  has  been  incorporated  into  the 
MONARCH  (MOrphable  Networked  ARCHitecture)  Project  under  the  DARPA  PCA 
(Polymorphous  Computer  Architecture)  Program  and  Godiva  in  partnership  with  Hewlett 
Packard  on  the  HPCS  (High  Productivity  Computing  System)  Program. 

2.  Introduction 

The  increasing  gap  between  processor  and  memory  speeds  is  a  well-known  problem  in  computer 
architecture,  with  peak  processor  performance  increasing  at  a  rate  of  50-60%  per  year  while 
memory  access  times  improve  at  merely  5-7%.  Eurther,  techniques  designed  to  hide  memory 
latency,  such  as  multithreading  and  prefetching,  actually  increase  the  memory  bandwidth 
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requirements  [Burger96].  Reeent  VLSI  teehnology  trends  offer  a  promising  solution  to  bridging 
the  proeessor-memory  gap:  embedded-DRAM  teehnology  integrates  logic  with  high-density 
memory  in  a  processing-in-memory  (PIM)  chip.  Because  PIM  internal  processors  can  be  directly 
connected  to  the  memory  banks,  the  memory  bandwidth  is  dramatically  increased  (with  hundreds 
of  gigabit/second  aggregate  bandwidth  available  on  a  chip  —  up  to  2  orders  of  magnitude  over 
conventional  DRAM).  Latency  to  on-chip  logic  is  also  reduced,  down  to  as  little  as  one  half  that 
of  a  conventional  memory  system,  because  internal  memory  accesses  avoid  the  delays  associated 
with  communicating  off  chip. 

The  system  described  in  this  report,  DIVA  (Data  Intensive  Architecture),  leverages  embedded- 
DRAM  technology  to  replace  or  augment  the  memory  system  of  a  conventional  workstation  with 
“smart  memories”  capable  of  very  large  amounts  of  processing.  System  bandwidth  limitations 
are  thus  overcome  in  three  ways:  (1)  tight  coupling  of  a  single  PIM  processor  with  an  on-chip 
memory  bank;  (2)  distributing  multiple  processor  memory  “nodes”  per  PIM  chip;  and,  (3) 
utilizing  a  separate  chip-to-chip  interconnect,  for  direct  communication  between  nodes  on 
different  chips  that  bypasses  the  host  system  bus.  The  DIVA  system  architecture  is  focused  on 
achieving  the  following  four  goals:  (1)  developing  PIMs  that  can  serve  as  the  only  memory  in  the 
system,  assuming  the  dual  roles  of  “smart  memories”  and  conventional  memory;  (2)  supporting  a 
wide  range  of  familiar  programming  paradigms,  closely  related  to  parallel  computing;  (3) 
targeting  applications  that  are  severely  impacted  by  the  processor-memory  bottlenecks  in 
conventional  systems:  sparse-matrix  and  pointer-based  applications  with  irregular  memory 
access  patterns,  and  image  and  video  applications  with  large  working  sets;  and,  (4)  developing  a 
VLSI  device  to  exploit  memory  and  communications  bandwidth  in  PIM-based  systems  while 
making  efficient  use  of  on-chip  resources  for  target  applications.  These  four  goals  distinguish 
DIVA  from  other  PIM-based  architectures. 

The  integration  into  a  conventional  system  affords  the  simultaneous  benefits  of  PIM  technology 
and  a  high-performance  microprocessor  host,  yielding  high  performance  for  mixed  workloads. 
Since  PIM  processors  are  usually  not  as  sophisticated  as  state-of-the-art  microprocessors  due  to 
on-chip  space  constraints,  systems  using  PIMs  alone  in  a  multiprocessor  may  sacrifice 
performance  on  uniprocessor  computations  [Saulsbury96][Kogge94],  while  SoC  (System-on-a- 
Chip)  solutions  (e.g.,  the  IRAM  [Patterson97]  and  the  Mitsubishi  M32R/D  [Mitsubishi99])  limit 
the  application  domain.  DIVA’s  support  for  a  broad  range  of  familiar  parallel  programming 
paradigms,  including  task  parallelism  for  irregular  computations,  distinguishes  it  from  systems 
with  restricted  applicability  (such  as  to  SIMD  parallelism  [Elliot99][Gokhale95][Patterson97]), 
as  well  as  systems  requiring  a  novel  programming  methodology  or  compiler  technology  to 
configure  logic  [Babb99],  or  to  manage  a  complex  memory,  computation  and  communication 
hierarchy  [Kang99].  DIVA’s  PIM-to-PIM  interconnect  improves  upon  approaches  that  serialize 
communication  through  the  host,  which  decreases  bandwidth  by  introducing  added  traffic  on  the 
processor  memory  bus  [Oskin98][Gokhale95]. 

A  major  challenge  in  meeting  the  above  four  goals  is  the  integrated  system  design,  which 
implements  the  system  architecture  and  spans  the  applications,  systems  software,  host-to- 
memory  interface,  memory-to-memory  interconnect,  PIM  software  and  embedded  DRAM  VLSI 
devices. 
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The  remainder  of  this  report  is  organized  as  follows.  The  next  seetion  summarizes  the  DIVA 
system  arehiteeture,  to  set  the  eontext  for  the  PIM  mieroarehiteeture  and  other  seetions  that 
follow.  Seetion  4  deseribes  the  VLSI  arehiteeture  and  implementation  in  detail.  Seetion  5 
presents  the  compiler  optimization,  implementation  and  performance  results.  Section  6  describes 
the  DIVA  system  simulator  that  supported  the  applications  and  architectural  development 
throughout  the  DIVA  Project.  Section  7  sets  out  the  details  of  how  the  DIS  benchmarks  and 
stressmarks  as  well  as  other  application  code  were  used  with  the  simulator  to  evaluate  DIVA’s 
performance.  Section  8  summarizes  our  approach  to  an  FPGA  (Field  Programmable  Gate  Array) 
based  emulator  and  the  lessons  that  we  learned  in  this  endeavor.  Section  9  presents  the  system 
integration  that  was  required  to  produce  a  successful  system  prototype  demonstration  at  DARPA 
Tech  2002.  In  the  remaining  sections,  we  summarize  our  results,  technology  transfer, 
publications  and  conclusions. 

3.  System  Architecture 

A  driving  principle  of  the  DIVA  system  architecture  is  to  efficiently  utilize  PIM  technology  in  a 
way  that  requires  only  “evolutionary”  software  support.  This  principle  demands  an  approach  that 
enables  integration  of  PIM  features  into  conventional  systems  as  seamlessly  as  possible. 
Therefore,  DIVA  chips  will  be  packaged  as  conventional  memory  modules.  Inserted  onto  a 
conventional  microprocessor  motherboard,  the  memory  on  the  DIVA  chips  is  accessed  by  the 
host  microprocessor  as  if  it  were  conventional  memory. 

In  Figure  I,  we  show  a  small  set  of  PIMs  connected  to  a  single  external  host  processor  through  a 
host-memory  interface.  The  PIM  chips  communicate  through  separate  PIM-to-PIM  channels. 
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Figure  1.  DIVA  system  architecture 

This  separate  memory-to-memory  interconnect  enables  communication  between  memories 
without  involving  the  host  processor. 

Spawning  computation,  gathering  results,  synchronizing  activity,  or  simply  accessing  non-local 
data  is  accomplished  via  parcels.  A  parcel  is  closely  related  to  an  active  message  as  it  is  a 
relatively  lightweight  communication  mechanism  containing  a  reference  to  a  function  to  be 
invoked  when  the  parcel  is  received  [vonEicken92].  Parcels  are  distinguished  from  active 
messages  in  that  the  destination  of  a  parcel  is  an  object  in  memory,  not  a  specific  processor. 

Parcels  are  transmitted  through  a  separate  PIM-to-PIM  interconnect  to  enable  communication 
without  interfering  with  host-memory  traffic.  This  interconnect  must  be  amenable  to  the  dense 
packing  requirement  of  memory  devices  and  allow  the  addition  or  removal  of  devices  from  the 
system.  For  system  sizes  of  the  scale  expected  for  DIVA  (on  the  order  of  32  PIM  chips),  this 
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combination  of  requirements  favors  a  one  dimensional  network  [KangOO].  Future  generations  of 
DIVA-like  systems  that  eontain  large  numbers  of  PIM  ehips  will  require  a  more  eomplex 
intereonneetion  network  and  are  the  topie  of  future  research. 

Parcels,  applieation  eode,  and  data  eontain  virtual  addresses.  To  translate  these  addresses  without 
the  overhead  of  maintaining  eonventional  page  tables  at  eaeh  node,  we  elassify  DIVA  memory 
aceording  to  usage  [Hall99]:  (1)  global  memory  visible  to  the  host  and  PIM  nodes;  (2)  dumb 
memory  alloeated  as  eonventional  pages  in  a  host  application's  virtual  space  and  untouched  by 
PIM  node  proeessing;  and,  (3)  local  memory  used  exelusively  by  PIM  node  routines.  To 
eondense  translation  information,  rather  than  page  tables,  we  use  segments,  eaeh  of  whieh  is 
defined  by  segment  registers  which  are  used  by  the  node  address  translation  unit  as  discussed 
below. 

The  primary  funetions  of  the  node  address  translation  unit  are  to  translate  virtual  addresses  to 
physieal  addresses  for  those  aeeesses,  whieh  are  loeally  resident,  and  to  provide  aeeess 
proteetion.  The  types  of  accesses  generated  by  a  DIVA  PIM  proeessor  that  require  translation 
inelude  instruetion  fetehes  and  data  aeeesses  to  memory  or  memory-mapped  deviees  sueh  as 
pareel  buffers,  generated  by  load  or  store  instruetions. 

Given  the  simplicity  of  the  address  translation  seheme,  very  little  hardware  support  is  needed  to 
effeet  effieient  translation.  A  segment  base  address  register  and  limit  register  is  needed  for  eaeh 
of  the  eight  loeal  segments.  Also,  one  virtual  base,  limit,  and  physieal  base  register  are  needed 
for  eaeh  resident  global  segment.  The  initial  DIVA  architeeture  provides  four  sets  of  global 
segment  registers,  although  alternative  architeetures  could  provide  more.  The  address  translation 
unit  eontains  no  direet  support  for  home  node  translation,  although  the  preferred  system 
programming  is  sueh  that  the  global  segments  resident  on  a  node  form  the  portion  of  global 
memory  for  whieh  that  node  is  the  home  node.  If  this  is  not  the  ease,  address  faults  invoke 
system  software,  whieh  performs  the  home  node  translation. 

In  addition  to  loeal  segments,  a  node  maintains  translation  information  for  its  portion  of  global 
memory.  Remote  addresses  are  translated  via  the  eoneept  of  a  home  node,  whieh  is  guaranteed  to 
have  the  translation  [Saulsbury95].  Thus,  each  node's  portion  of  global  memory  includes  objects 
for  which  it  is  the  home  node.  The  major  advantages  of  this  approaeh  are  that  translation  may  be 
aeeomplished  rapidly,  and  translation  information  on  eaeh  PIM  seales  well. 

Memory  management  functionality  is  distributed  among  the  host's  standard  operating  system, 
augmented  with  support  for  PIMs,  and  run-time  kernels  on  eaeh  PIM  proeessor.  Unlike  standard 
multiproeessor  systems,  the  host,  whieh  has  a  system-level  view,  remains  a  eentral  figure  in 
system-level  seheduling,  disk  I/O  operations,  and  memory  management.  The  PIM  run-time 
kernel  must  eollaborate  with  the  host  on  system-level  operations,  sueh  as  loading  PIM  programs 
and  data,  memory  management  of  PIM-visible  segments,  and  PIM  eontext  switches  between 
different  user  programs.  The  ehallenge  in  this  eollaboration  is  that  there  are  really  two  views  of 
memory  that  must  be  maintained.  For  dumb  pages  and  for  disk  I/O  of  PIM-visible  segments,  the 
host  sees  memory  as  standard  4Kbyte  pages;  the  PIM  run-time  kernel  instead  views  PIM-visible 
memory  as  variable-sized  segments  [HallOO]. 
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4.  VLSI  Architecture  and  Implementation 

The  goal  of  the  VLSI  development  on  the  DIVA  projeet  was  to  produce  a  prototype  chip  that 
demonstrated  the  enormous  bandwidth  available  between  memory  blocks  and  processing 
subcomponents  on  a  processing-in-memory  (PIM)  device.  As  the  following  sections  discuss,  the 
DIVA  project  was  very  successful  with  its  VLSI  demonstrations  and  was  the  first  effort  under 
the  Data-Intensive  Systems  (DIS)  program  to  deliver  working  silicon.  The  bulk  of  this  effort  can 
be  categorized  into  chip-level  architecture  research  and  VLSI  implementation. 

4.1  PIM  Chip  Architecture 

Each  DIVA  PIM  chip  is  a  VLSI  memory  device  augmented  with  general-purpose  computing  and 
networking/communication  hardware.  Although  a  PIM  may  consist  of  multiple  nodes,  each  of 
which  are  primarily  comprised  of  a  few  megabytes  of  memory  and  a  node  processor.  Figure  2 
shows  a  PIM  with  a  single  node,  which  reflects  the  focus  of  the  research  that  was  conducted  on 
the  DIVA  project.  Nodes  on  a  PIM  chip  share  a  single  PIM  Routing  Component  (PiRC)  and  a 
host  interface.  The  PiRC  is  responsible  for  routing  parcels  on  and  off  chip.  The  host  interface 
implements  the  JEDEC  standard  SDRAM  (Synchronous  Dynamic  Random  Access  Memory) 
protocol  so  that  memory  accesses  as  well  as  parcel  activity  initiated  by  the  host  appear  as 
conventional  memory  accesses  from  the  host  perspective.  More  details  of  the  PiRC  can  be  found 
in  [KangOO]  and  more  information  on  the  host  interface  is  given  in  [Draper02a]. 

Figure  2  also  shows  two  interconnects  that  span  a  PIM  chip  for  information  flow  between  nodes, 
the  host  interface,  and  the  PiRC.  Each  interconnect  is  distinguished  by  the  type  of  information  it 
carries.  The  PIM  memory  bus  is  used  for  conventional  memory  accesses  from  the  host  processor. 
The  parcel  interconnect  allows  parcels  to  transit  between  the  host  interface,  the  nodes,  and  the 
PiRC.  Within  the  host  interface,  a  parcel  buffer  (PBUF)  is  a  buffer  that  is  memory-mapped  into 
the  host  processor's  address  space,  permitting  application-level  communication  through  parcels. 
Each  PIM  node  also  has  a  PBUF,  memory-mapped  into  the  node's  local  address  space.  More 
information  on  the  PBUF  design  is  found  in  Appendix  A2:  DIVA  Node  Architecture  manual. 
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Figure  2.  DIVA  PIM  chip  organization 


Figure  3  shows  the  major  control  and  data  connections  within  a  node,  with  the  256-bit  memory 
data  bus  as  the  centerpiece.  The  DIVA  PIM  node  processing  logic  supports  single-issue,  in-order 
execution,  with  32-bit  instructions  and  32-bit  addresses.  There  are  two  datapaths  whose  actions 
are  coordinated  by  a  single  execution  control  unit:  a  scalar  datapath  that  performs  sequential 
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operations  on  32-bit  operands,  and  a  WideWord  datapath  that  performs  fine-grain  parallel 
operations  on  256-bit  operands.  Both  datapaths  execute  from  a  single  instruction  stream  under 
the  control  of  a  single  5-stage  DLX  (Deluxe)-like  pipeline.  The  instruction  set  has  been  designed 
so  both  datapaths  can,  for  the  most  part,  use  the  same  opcodes  and  condition  codes,  generating  a 
large  functional  overlap. 


Figure  3,  DIVA  PIM  node  architecture 

Each  datapath  has  its  own  independent  general-purpose  register  fide,  32  32-bit  registers  for  the 
scalar  datapath  and  32  256-bit  registers  for  the  WideWord  datapath,  but  special  instructions 
permit  direct  transfers  between  datapaths  without  going  through  memory.  Although  not 
supported  in  the  initial  DIVA  prototype,  floating-point  extensions  to  the  WideWord  datapath  will 
be  provided  in  future  implementations.  The  memory  arbiter/controller  is  responsible  for 
generating  proper  control  signals  to  the  memory  macro.  Its  functions  include  initiating  refresh 
cycles  as  needed  and  arbitrating  between  the  host  memory  port  and  the  execution  control  unit  for 
access  to  the  memory  macro.  Furthermore,  it  tracks  and  maintains  an  open  row  in  the  DRAM 
macro  to  enable  page-mode  accesses  as  often  as  possible.  Another  key  component  of  each  PIM 
node  is  an  instruction  cache,  which  was  included  in  the  DIVA  design  to  keep  instruction  accesses 
to  the  memory  macro  from  interfering  with  data  accesses  as  much  as  possible.  Each  node  also 
contains  a  parcel  buffer  (PBUF),  as  described  earlier.  The  following  sections  briefly  discuss  the 
scalar  and  WideWord  subcomponents,  highlighting  some  of  the  more  notable  features.  More 
detail  on  these  microarchitectures  as  well  as  those  of  other  subcomponents  of  the  DIVA  PIM 
chip  can  be  found  in  the  Appendices. 

4.1.1  Microarchitecture:  The  Scalar  Processor 

As  noted  earlier,  the  combination  of  the  execution  control  unit  and  scalar  datapath  is  a  standard 
RISC  processor  and  serves  as  the  DIVA  scalar  processor,  or  microcontroller.  It  coordinates  all 
activity  within  a  DIVA  PIM  node.  This  section  details  the  microarchitecture  of  this  component 
by  first  presenting  an  overview  of  the  instruction  set  architecture,  followed  by  a  description  of 
the  pipeline  and  discussion  of  special  features.  More  detail  of  the  instruction  set  can  be  found  in 
Appendix  Al:  DIVA  Instruction  Set  Manual. 

Instruction  set  architecture  overview 


6 


Much  like  the  Hennessy  and  Patterson  DLX  arehiteeture,  most  DIVA  sealar  instruetions  use  a 
three-operand  format  to  speeify  two  souree-registers  and  a  destination  register,  as  shown  in 
Figure  4.  For  these  types  of  instructions,  the  opeode  generally  denotes  a  elass  of  operations,  sueh 
as  arithmetie,  and  the  function  denotes  a  speeifie  operation,  sueh  as  add.  The  C  bit  indicates 
whether  the  operation  performed  by  the  instruetion  exeeution  updates  eondition  eodes.  In  lieu  of 
a  seeond  souree  register,  a  16-bit  immediate  value  may  be  speeified,  as  shown  in  Figure  5.  The 
sealar  instruction  set  ineludes  the  typieal  arithmetie  funetions  add,  subtraet,  multiply,  and  divide; 
logieal  functions  AND,  OR,  NOT,  and  XOR;  and  logical/arithmetic  shift  operations.  In  addition, 
there  are  a  number  of  speeial  instruetions,  deseribed  in  Speeial  Features  section  below. 
Load/store  instruetions  adhere  to  the  immediate  format,  where  the  address  for  the  memory 
operation  is  formed  by  the  addition  of  an  immediate  value  to  the  contents  of  rA,  which  serves  as 
a  base  address.  The  DIVA  sealar  proeessor  does  not  support  a  base-plus-register  addressing 
mode  beeause  sueh  a  mode  requires  an  extra  read  port  on  the  register  file  for  store  operations. 
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Figure  5,  Scalar  immediate  instruction  format 

Branch  instructions  use  a  different  format.  The  branch  target  address  may  be  PCrelative,  useful 
for  relocatable  code,  or  calculated  using  a  base  register  combined  with  an  offset,  useful  with 
table-based  branch  targets.  In  both  formats,  the  offset  is  in  units  of  instruction  words,  or  4  bytes. 
By  specifying  the  offset  in  instruction  words,  rather  than  bytes,  a  larger  branch  window  results. 
To  support  function  calls,  the  branch  instruction  format  also  includes  a  bit  for  specifying  linkage, 
that  is,  whether  a  return  instruction  address  should  be  saved  in  R31.  The  branch  format  also 
includes  a  3 -bit  condition  field  to  specify  one  of  eight  branch  conditions:  always,  equal,  not 
equal,  less  than,  less  than  or  equal,  greater  than,  greater  than  or  equal,  or  overflow. 

Pipeline  description  and  associated  hazards 

A  high-level  schematic  of  the  pipeline  execution  control  unit  and  scalar  datapath  is  shown  in 
Figure  6.  The  pipeline  is  a  standard  DLX-like  5 -stage  pipeline,  with  the  following  stages:  (1) 
instruction  fetch;  (2)  decode  and  register  read;  (3)  execute;  (4)  memory;  and,  (5)  write-back. 
Figure  6  indicates  these  five  stages  with  respect  to  the  data-path  registers  and  also  indicates  the 
write-back  and  bypass  datapaths.  The  pipeline  controller  contains  the  necessary  logic  to  handle 
data,  control,  and  structural  hazards.  Data  hazards  occur  when  there  are  read-after-write  register 
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dependences  between  instructions  that  co-exist  in  the  pipeline.  The  controller  and  datapath 
contain  the  necessary  forwarding,  or  bypass,  logic  to  allow  pipeline  execution  to  proceed  without 
stalling  in  most  data  dependence  cases.  The  only  exception  to  this  generality  involves  the  load 
instruction,  where  a  "bubble"  must  be  inserted  between  the  load  instruction  and  an  immediately 
following  instruction  that  uses  the  load  target  register  as  one  of  its  source  operands. 

Control  hazards  occur  for  branch  instructions.  Unlike  the  DLX  architecture,  which  uses  explicit 
comparison  instructions  and  testing  of  a  general-purpose  register  value  for  branching  decisions, 
the  DIVA  design  incorporates  condition  codes  that  may  be  updated  by  most  arithmetic/logical 
instructions.  The  condition  codes  used  for  branching  decisions  are: 

•  EQ  -  set  if  the  result  is  zero 

•  LT  -  set  if  the  result  is  negative 

•  GT  -  set  if  the  result  is  positive 

•  OV  -  set  if  the  operation  overflows 

The  DIVA  pipeline  design  imposes  a  1 -delay  slot  branch,  so  that  the  instruction  following  a 
branch  instruction  is  always  executed.  Since  branches  are  always  resolved  within  the  second 
stage  of  the  pipeline,  no  stalls  or  bubbles  are  associated  with  branch  instructions. 

Since  the  general-purpose  register  fde  contains  2  read  ports  and  1  write  port,  it  may  sustain  two 
operand  reads  and  1  result  write  every  clock  cycle;  thus,  the  register  fde  design  introduces  no 
structural  hazards.  The  only  structural  hazard  that  impacts  the  pipeline  operation  is  the  node 

memory.  Pipeline  stalls  may  occur  in  the  instruction  fetch  stage  if  an  instruction  cache  miss 

occurs.  The  pipeline  will  resume  once  the  cache  fill  memory  request  has  been  satisfied.  Likewise, 
stalls  occur  any  time  a  load/store  instruction  reaches  the  memory  stage  of  the  pipeline  until  the 
memory  operation  is  completed. 
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Figure  6,  Scalar  datapath  and  pipeline  stages 


Special  features 

The  novelty  of  the  DIVA  scalar  processor  lies  in  the  special  features  that  support  DIVA-specific 
functions.  Although  by  no  means  exhaustive,  this  section  highlights  some  of  the  more  notable 
capabilities. 

Run-time  Kernel  Support 

The  execution  control  unit  supports  supervisor  and  user  modes  of  processing  and  also  maintains 
a  number  of  special-purpose  and  protected  registers  for  support  of  exception  handling,  address 
translation,  and  general  OS  (Operating  System)  services.  Exceptions,  arising  from  execution  of 
node  instructions,  and  interrupts,  from  other  sources  such  as  an  internal  timer  or  external 
component  like  the  PBUF,  are  handled  by  a  common  mechanism. 

The  exception-handling  scheme  for  DIVA  has  a  modest  hardware  requirement,  exporting  much 
of  the  complexity  to  software,  to  maintain  a  flexible  implementation  platform.  It  provides  an 
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integrated  mechanism  for  handling  hardware  and  software  exception  sources  and  a  flexible 
priority  assignment  scheme  that  minimizes  the  amount  of  time  that  exception  recognition  is 
disabled.  While  the  hardware  design  allows  traditional  stack-based  exception  handlers,  it  also 
supports  a  non-recursive  dispatching  scheme  that  uses  DIVA  hardware  features  to  allow 
preemption  of  lower  priority  exception  handlers. 

The  impact  of  run-time  kernel  support  on  the  scalar  processor  design  is  the  addition  of  a  modest 
number  of  special-purpose  and  protected  (or  supervisor-level)  registers  and  a  non-negligible 
amount  of  complexity  added  to  the  pipeline  control  for  entering/exiting  exception  handling 
modes  cleanly.  When  the  scalar  processor  control  unit  detects  an  exception,  the  logic  performs  a 
number  of  tasks  within  a  single  clock  cycle  to  prepare  the  processor  for  entering  an  exception 
handler  in  the  next  clock  cycle. 

Those  tasks  include: 

•  determining  which  exception  to  handle  by  prioritizing  among  simultaneously  occurring 
exceptions, 

•  setting  up  shadow  registers  to  capture  critical  state  information,  such  as  the  processor 
status  word  register,  the  instruction  address  of  the  faulting  instruction,  the  memory 
address  if  the  exception  is  an  address  fault,  etc, 

•  configuring  the  program  counter  logic  to  load  an  exception  handler  address  on  the  next 
clock  cycle,  and 

•  setting  up  the  processor  status  word  register  to  enter  supervisor  mode  with  exception 
handling  temporarily  disabled. 

Once  invoked,  the  exception  handler  first  stores  other  pieces  of  user  state  and  interrogates 
various  pieces  of  state  hardware  to  determine  how  to  proceed.  Once  the  exception  handler 
routine  has  completed,  it  restores  user  state  and  then  executes  a  return-from-exception  instruction, 
which  copies  the  shadow  register  contents  back  into  various  state  registers  to  resume  processing 
at  the  point  before  the  exception  was  encountered.  If  it  is  impossible  to  resume  previous 
processing  due  to  a  fatal  exception,  the  run-time  kernel  exception  handler  may  choose  to 
terminate  the  offending  process. 

Interaction  with  the  WideWord Datapath 

There  are  a  number  of  features  in  the  scalar  processor  design  involving  communication  with  the 
WideWord  datapath  that  greatly  enhance  performance.  The  path  to/from  the  WideWord  datapath 
in  the  execute  stage  of  the  pipeline  facilitates  the  exchange  of  data  between  the  scalar  and 
WideWord  datapaths  without  going  through  memory.  This  capability  distinguishes  DIVA  from 
other  architectures  containing  vector  units,  such  as  AltiVec.  This  path  also  allows  scalar  register 
values  to  be  used  to  specify  WideWord  functions,  such  as  indices  for  selecting  subfields  within 
WideWords  and  indices  into  permutation  look-up  tables.  Instead  of  requiring  an  immediate  value 
within  a  WideWord  instruction  for  specifying  such  indices,  this  register-based  indexing 
capability  enables  more  intelligent,  efficient  code  design. 

There  are  also  a  couple  of  instructions  that  are  especially  useful  for  enabling  efficient  data 
mining  operations.  ELO,  encode  leftmost  one,  and  CEO,  clear  leftmost  one,  are  instructions  that 
generate  a  5-bit  index  corresponding  to  the  bit  position  of  the  leftmost  one  in  a  32-bit  value  and 
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clear  the  leftmost  one  in  a  32-bit  value,  respeetively.  These  instruetions  are  especially  useful  for 
examining  the  32-bit  WideWord  condition  code  register  values,  whieh  may  be  transferred  to 
sealar  general-purpose  registers  to  perform  sueh  tests.  For  instance,  with  this  capability,  finding 
and  processing  data  items  that  match  a  specified  key  are  aeeomplished  in  much  fewer 
instructions  than  a  sequence  of  bit  masking  and  shifting  involved  in  32  bit  tests,  whieh  is 
required  with  conventional  processor  arehiteetures. 

There  are  some  variations  of  the  branch/call  instruetions  that  also  interact  with  the  WideWord 
datapath.  The  BA  (braneh  on  all)  instruetion  speeifies  that  a  braneh  is  to  be  taken  if  the  status  of 
condition  codes  within  every  subfield  of  the  WideWord  datapath  matches  the  condition  specified 
in  the  BA  instruction.  The  BN  (branch  on  none)  instruction  specifies  that  a  branch  is  to  be  taken 
if  the  status  of  condition  codes  within  no  subfield  of  the  WideWord  datapath  matehes  the 
eondition  specified  in  the  BN  instruction.  With  proper  code  structuring  around  these  instructions, 
inverse  forms  of  these  branehes,  such  as  branch  on  any  or  branch  on  not  all,  can  also  be  affected. 

Miscellaneous  Instructions 

There  are  also  several  other  miscellaneous  instructions  that  add  some  complexity  to  the 
processor  design.  The  probe  instruction  allows  a  user  to  interrogate  the  address  translation  logie 
to  see  if  a  global  address  is  loeally  mapped.  This  eapability  allows  users  who  wish  to  optimize 
eode  for  performance  to  avoid  slow,  overhead-laden  address  translation  exceptions.  Also,  an 
instruction  cache  invalidate  instruetion  allows  the  supervisor  kernel  to  eviet  user  eode  from  the 
eache  without  invalidating  the  entire  eaehe  and  is  useful  in  proeess  termination  eleanup 
proeedures.  Lastly,  there  are  versions  of  load/store  instructions  that  “lock”  memory  operations, 
whieh  are  useful  for  implementing  synchronization  functions,  such  as  semaphores  or  barriers. 

4.1.2  Microarchitecture:  The  WideWord  Processor 

The  eombination  of  the  execution  control  unit  and  WideWord  datapath  is  regarded  as  the 
WideWord  Proeessor.  This  eomponent  enables  superword-level  parallelism  on  wide  words  of 
256  bits,  similar  to  multimedia  extensions  such  as  MMX  and  AltiVec.  This  fine-grain  parallelism 
offers  additional  opportunity  for  exploiting  the  inereased  proeessor-memory  bandwidth  available 
in  a  PIM.  Selective  execution,  direct  transfers  to/from  other  register  files,  integration  with 
communieation,  as  well  as  the  ability  to  aceess  main  memory  at  very  low  latency,  distinguish  the 
DIVA  WideWord  capabilities  from  MMX  and  AltiVec.  This  section  details  the 
microarehitecture  of  this  component  by  first  presenting  an  overview  of  the  instruction  set 
architeeture,  followed  by  a  brief  description  of  the  pipeline.  More  detail  ean  be  found  in 
[Draper02a]. 

WideWord  Instruction  set  architecture 
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Figure  7.  WideWord  instruction  format 


11 


As  shown  in  Figure  7,  most  DIVA  WideWord  instructions  use  a  three-operand  format  to  specify 
two  256-bit  source  registers  and  a  256-bit  destination  register.  The  opcode  generally  denotes  a 
class  of  operations,  such  as  arithmetic,  and  the  function  denotes  a  specific  operation,  such  as  add 
or  subtract.  The  C  bit  indicates  whether  the  operation  performed  by  the  instruction  execution 
updates  condition  codes.  The  W  field  indicates  the  operand  width,  allowing  WideWord  data  to 
be  treated  as  a  packed  array  of  objects  of  eight,  sixteen,  or  thirty-two  bits  in  size.  This 
characteristic  means  the  WideWord  ALU  (Arithmetic  Logic  Unit)  can  be  represented  as  a 
number  of  variable-width  parallel  ALUs.  The  P  field  indicates  the  participation  mode,  a  form  of 
selective  subfield  execution  that  depends  on  the  state  of  local  and  neighboring  condition  codes. 
Under  selective  execution,  only  the  results  corresponding  to  the  subfields  that  participate  in  the 
computation  are  written  back,  or  committed,  to  the  instruction's  destination  register.  The 
subfields  that  participate  in  the  conditional  execution  of  a  given  instruction  are  derived  from  the 
condition  codes  or  a  mask  register,  plus  the  instruction's  2-bit  participation  field. 

The  WideWord  instruction  set  consists  of  roughly  30  instructions  implementing  typical 
arithmetic  instructions  like  add,  subtract,  and  multiply;  logical  functions  like  AND,  OR,  NOT, 
XOR;  and  logical/arithmetic  shift  operations.  In  addition,  there  are  load/store  and  transfer 
instructions  that  provide  for  rich  interactions  between  the  scalar  and  WideWord  datapaths. 

Some  special  instructions  include  permutation,  merge,  and  pack/unpack.  The  WideWord 
permutation  network  supports  fast  alignment  and  reorganization  of  data  in  wide  registers.  The 
permutation  network  enables  any  8-bit  data  field  of  the  source  register  to  be  moved  into  any  8-bit 
data  field  of  the  destination  register.  A  permutation  is  specified  by  a  permutation  vector,  which 
contains  32  indices  corresponding  to  the  32  8-bit  subfields  of  a  WideWord  destination  register.  A 
WideWord  permutation  instruction  selects  a  permutation  vector  by  either  specifying  an  index 
into  a  small  set  of  hard-wired  commonly  used  permutations  or  a  WideWord  register  whose 
contents  are  the  desired  permutation  vector.  The  merge  instruction  allows  a  WideWord 
destination  to  be  constructed  from  the  intermixing  of  subfields  from  two  source  operands,  where 
the  source  for  each  destination  subfield  is  selected  by  a  condition  specified  in  the  instruction. 
This  merge  instruction  effects  efficient  sorting.  The  pack/unpack  instructions  allow  the 
truncation/elevation  of  data  types  and  are  especially  useful  in  pixel  processing. 

Pipeline  description 

Identical  to  and  tightly  integrated  with  the  scalar  pipeline,  the  pipeline  of  the  WideWord  datapath 
is  a  standard  DLX-like  5 -stage  pipeline,  with  the  following  stages:  (1)  instruction  fetch;  (2) 
decode  and  register  read;  (3)  execute;  (4)  memory;  and,  (5)  writeback.  Data  hazards  occur  when 
there  are  read-after-write  register  dependences  between  instructions  that  co-exist  in  the  pipeline. 
The  controller  and  datapath  contain  the  necessary  forwarding,  or  bypass,  logic  to  allow  pipeline 
execution  to  proceed  without  stalling  in  most  data  dependence  cases.  Register  forwarding  is 
complicated  somewhat  by  the  participation  capability.  Participation  status  must  be  forwarded 
along  with  each  subfield  to  effect  correct  forwarding. 

4.2  VLSI  Development 

From  a  host  of  potential  foundries  for  fabrication,  the  selections  were  quickly  narrowed  down  to 
two  possible  embedded  DRAM  candidates  early  in  the  DIVA  project:  IBM  and  TSMC.  IBM 
clearly  had  more  experience  in  the  embedded  DRAM  arena,  so  early  efforts  in  the  DIVA  VLSI 
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development  task  targeted  the  IBM  CMOS7LD  0.25ocm  embedded  DRAM  proeess,  and  a  sealar 
proeessor  test  ehip  was  fabrieated  in  HP  CMOS  14  0.5ocm  teehnology  through  MOSIS.  (The  HP 
proeess  was  used  for  early  prototyping  beeause  its  logie  speed  matehed  that  of  the  IBM  proeess, 
and  prototypes  eould  be  built  very  eheaply  through  this  route.)  A  test  vehiele  on  the  TSMC 
0.25ocm  proeess  was  also  fabrieated  to  gain  familiarity  with  that  teehnology.  Although  the  DIVA 
team  entered  into  a  researeh  collaboration  contract  with  the  Blue  Gene  team  at  IBM  Watson,  the 
DIVA  project  was  not  granted  access  to  IBM  fabrication  capability  in  a  timely  manner. 
Therefore,  in  the  final  half  of  the  project,  the  VLSI  development  for  the  integrated  PIM 
prototype  targeted  the  TSMC  O.lSocm  process.  This  process  was  introduced  with  an  embedded 
DRAM  capability,  but  that  capability  was  later  phased  out,  so  the  DIVA  prototype  PIM  was 
fabricated  with  SRAM  (Synchronous  Random  Access  Memory)  as  a  placeholder  for  embedded 
DRAM. 
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Figure  8,  Prototype  PIM  signal  summary 

As  part  of  the  core  VLSI  development  task,  a  new  CAD  tool  flow  was  installed.  To 
accommodate  rapid  design  of  the  PIM  chip,  we  relied  heavily  on  the  ability  to  specify  the  chip 
design  with  RTL-level  VHDL  and  synthesize  this  description  into  a  gate-level  netlist  of  standard 
cells.  The  VHDL  was  optimized  and  synthesized  using  Synopsys  Design  Analyzer,  targeting  the 
Artisan  standard  cell  library  for  TSMC  O.lSocm  technology.  The  entire  chip  was  placed  and 
routed,  including  clock  tree  routing,  with  Cadence  Silicon  Ensemble.  Physical  verification, 
including  DRC,  LVS,  and  antennae  checking,  was  performed  with  Mentor  Calibre.  Back- 
annotated  simulation  to  verily  correct  operation  and  timing  of  the  design  was  performed  within 
the  Cadence  Verilog  environment. 

A  description  of  the  external  signals  of  the  first  prototype  PIM  chip  is  shown  in  Figure  8.  There 
are  primarily  two  external  interfaces:  a  host  interface  for  implementing  the  JEDEC  SDRAM 
standard  and  the  PiRC  signals  for  inter-PIM  communication.  Additionally,  there  are  signals  for 
configuring  and  monitoring  the  PLE  (Phase  Eocked  Eoop)  clock  multiplier,  testing  the  node 
SRAMs,  and  reset  and  interrupt  capabilities. 
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This  prototype  chip  implements  one  PIM  node  (consisting  of  a  32-bit  scalar  processor,  256-bit 
WideWord  Unit,  4Kbyte  instruction  cache,  8Mbit  node  SRAM,  and  node  parcel  buffer),  PIM 
routing  component  (PiRC),  and  host  interface  (containing  an  external  SDRAM  interface  and  host 
parcel  buffer).  The  design  was  submitted  on  August  23,  2001  for  fabrication  on  a  TSMC  O.lSocm 
generic  process  offered  through  MOSIS.  The  intellectual  property  used  in  the  chip  design  is  from 
three  different  vendors: 

•  Artisan 

o  standard  cells  for  synthesized  logic 
o  pads 

o  32-word  x  32-bit  scalar  register  file 

o  32-word  x  256-bit  WideWord  register  file  (implemented  as  two  xl28  banks) 
o  4kbyte  SRAM  for  instruction  cache  core  (implemented  as  two  banks  of 
o  128  word  x  128-bit  SRAMs) 
o  128  word  x  20-bit  SRAM  for  instruction  cache  tags 

•  Virage  Logic 

o  8  Mbit  SRAM  (with  redundancy  to  allow  repair)  (implemented  as  two  banks  of 
32768  words  x  128  bits) 

o  fuse  boxes  for  the  configuration  of  the  SRAM 

•  NurLogic 

o  PLL  for  clock  multiplication  and  deskewing 

The  resulting  chip  is  9.8mm  on  a  side  and  contains  approximately  200,000  placeable  objects, 
where  a  placeable  object  is  anything  from  a  2-transistor  inverter  to  a  4  Mbit  SRAM  macro.  The 
chip  contains  approximately  55  million  transistors,  with  2  million  in  the  logic  and  smaller 
SRAMs  and  53  million  in  the  8  Mbits  of  node  SRAM.  The  chip  contains  352  pads:  240  signal 
I/O,  56  grounds,  28  pad  Vdd  (3.3V),  and  28  core  Vdd  (I.8V). 


The  silicon  die  were  received  near  the  end  of  October  2001,  and  packaged  chips  were  received 
near  the  end  of  November  2001.  Photos  of  the  die  and  package  are  shown  in  Figure  9.  Due  to 
delays  in  procurement  of  test  fixtures,  full-scale  testing  did  not  commence  until  February  2002. 


Figure  9,  DIVA  PIM  prototype  chip 


The  preliminary  testing  was  conducted  with  the  use  of  a  custom-built  PCS  in  an  incremental 
fashion.  First,  with  all  functional  units  in  reset,  we  applied  power  and  an  input  clock  signal  to  test 
the  PLL  clock  multiplier,  IP  purchased  from  NurLogic.  The  PLL  was  functional  over  a  wide 
range  of  frequencies,  voltages,  and  all  possible  configurations  of  input  settings.  This  verification 
proved  that  we  had  successfully  integrated  IP  from  a  3rd-party  vendor  into  our  design  flow.  We 
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then  proceeded  to  functional  testing  with  the  use  of  an  HP  16702 A  logic  analysis  system.  Pattern 
generator  modules  were  utilized  to  apply  test  vectors  to  the  inputs  of  the  chip,  and  timing/state 
capture  modules  were  used  to  sense  the  outputs  of  the  chip.  A  photo  of  the  lab  test  setup  is 
shown  in  Figure  10.  The  chip  was  tested  for  functionality  at  a  testbench  speed  of  80MHz. 


Figure  10,  PIM  testbench  setup 


We  first  verified  the  operation  of  the  memory  access  capability  of  the  PIM  chip  by  performing 
writes/reads  to  the  internal  memory  through  the  host  memory  interface  of  the  PIM  chip.  After 
verifying  normal  memory  operation  for  the  lowest  64KB  region  of  memory,  we  proceeded  to 
PIM  processor  checkout.  The  procedure  consisted  of  downloading  code  through  the  host 
memory  interface,  releasing  the  PIM  processor  from  reset  to  execute  the  code,  and  then  verifying 
correct  operation  by  reading  back  results  through  the  host  memory  interface.  After  confirming 
the  validity  of  this  debugging  approach  through  a  small  arbitrary  code  example,  we  proceeded  to 
test  the  execution  of  the  Cornerturn  core  loop,  which  had  been  coded  to  exploit  novel  features  of 
the  DIVA  PIM  WideWord  Unit.  Reading  the  memory  locations  that  contained  the  output  matrix 
and  verifying  that  the  input  matrix  had  indeed  been  transposed  confirmed  successful  execution  of 
the  code.  (The  logic  analyzer  display  showing  the  start  of  the  transposed  matrix  is  shown  in 
Figure  1 1).  We  then  began  some  speed  testing  to  determine  the  clock  frequency  operating  range 
of  the  PIM  chip.  We  were  able  to  execute  the  Cornerturn  application  at  160MHz  while 
dissipating  only  800mW.  Even  in  this  limited  test  setup,  the  chip  achieved  a  peak  1 .28GOPS  (32- 
bit  ops)  and  5.12  GB/s  memory  bandwidth.  After  passing  these  initial  tests,  the  chip  was  released 
to  the  system  integration  team  where  many  more  results  were  achieved  (refer  to  the  system 
integration  section  for  details). 
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Figure  11,  Display  of  read  operation  “cornerturn”  output  matrix 


4.3  Ongoing  and  Future  Work 

While  finishing  preparations  for  testing  the  first  chip,  we  were  also  working  on  the  designs  of  the 
address  translation  unit  and  floating  point  capability  for  the  second  turn  of  the  chip.  The  address 
translation  was  completed  and  integrated  into  the  existing  design  and  validated  through 
simulation,  including  exception  handling  related  to  address  faults  within  a  few  months.  After 
performing  some  initial  sizing  estimates,  we  realized  that  we  would  not  be  able  to  fit  4  parallel 
double-precision  floating-point  units  in  our  WideWord  area  budget,  so  we  targeted  8  single¬ 
precision  units.  As  technology  continues  to  scale,  future  PIMs  may  revisit  the  possibility  of 
WideWord  double-precision  capability.  Each  single-precision  unit  implements  the  basic  floating¬ 
point  functions:  add,  subtract,  multiply  divide.  We  used  the  MIT  RAW  design  as  a  guideline,  but 
due  to  DIVA  pipeline  constraints  were  not  able  to  use  the  RAW  design  as  is.  We  spent  most  of 
our  time  on  the  design  of  the  divider  and  then  optimizing  to  merge  the  subcomponents  to  share 
resources  that  all  subcomponents  need,  such  as  operand  formatting,  rounding,  and  normalization. 
We  selected  a  divider  design  based  on  a  Taylor  series  expansion  approach  developed  by 
Liddicoat  at  Stanford  [Liddicoat02].  This  design  achieved  a  fairly  high-performance  divide 
capability  while  minimizing  silicon  area.  We  synthesized  the  entire  FPU  (Floating  Point  Unit) 
design,  and  the  resulting  post-synthesis  area  projections  indicated  an  area  of  0.32  mm2  for  each 
single-precision  floating-point  unit,  or  a  total  of  approximately  2.5  mmz  for  eight  such  units  in 
the  WideWord  datapath. 

We  re-architected  the  exception-handling  unit  to  accommodate  integration  of  the  exceptions 
from  the  WideWord  floating-point  units.  Each  of  the  eight  single-precision  floating-point  units 


16 


of  the  WideWord  datapath  reports  five  types  of  exeeptions:  divide  by  zero,  inexact,  invalid, 
overflow,  and  underflow.  The  only  inconsistency  with  the  IEEE-754  standard  is  the  underflow 
exception,  which  we  use  in  place  of  supporting  denormalized  numbers  and  arithmetic.  We  have 
combined  the  overflow  and  underflow  status  outputs  into  one  value  called  precision  status  so  that 
the  resulting  4  exception  types  of  all  8  single-precision  EPUs  can  be  contained  in  one  32-bit 
register.  We  have  defined  a  new  special-purpose  register  (SPR)  in  our  architecture  to  capture  this 
information. 

Work  is  now  continuing  under  separate  funding  to  implement  the  exception  integration  and 
thereby  complete  the  integration  of  floating-point  capability  into  the  DIVA  design.  Under  the 
HPCS-funded  Godiva  project,  a  DDR  SDRAM  interface  is  also  being  added  to  the  rev  2  PIM 
chip  for  its  insertion  into  an  Itanium2 -based  HP  Eong’s  peak  server. 


5.  Compiler 

We  have  developed  a  compiler  for  the  DIVA  PIM  processor  that  generates  optimized  code  in  the 
DIVA  ISA.  As  will  be  discussed  in  the  context  of  system  integration,  the  DIVA  compiler 
backend  is  based  on  the  Gnu  GCC  compiler,  ported  from  the  PowerPC  toolset.  GCC  is  a 
commonly  used  optimizing  compiler,  but  it  targets  conventional  scalar  instruction  sets.  To 
support  optimizations  targeting  the  unique  bandwidth-exploiting  features  of  the  DIVA  ISA,  we 
developed  front-end  compiler  technology  that  performs  DIVA-specific  optimizations,  as 
captured  in  Eigure  12. 


compiler-controlled  caching 

superword-level  parallelism 

Figure  12,  DIVA-specific  compiler  optimizations 
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In  Figure  12,  the  ovals  represent  the  funetional  units  of  the  DIVA  PIM  ehip.  As  has  been 
previously  deseribed  in  the  arehiteeture  diseussion,  there  are  both  a  32-bit  sealar  funetional  unit 
and  a  separate  256-bit  wide  funetional  unit.  The  shaded  reetangles  in  the  figure  represent  on-chip 
storage.  There  is  the  DRAM  array,  which  in  today’s  technology  could  have  up  to  32Mbytes, 
although  in  our  prototype  it  is  a  1Mbyte  SRAM  array,  as  previously  described.  A  4Kbyte  I-cache 
holds  the  instruction  stream,  so  that  memory  accesses  are  predominantly  focused  on  the  program 
data.  In  addition,  there  are  separate  register  files  associated  with  each  functional  unit,  a  32- 
element,  32-bit  scalar  register  file,  and  a  32-element,  256-bit  wide  register  file. 

The  unshaded  rectangles  in  the  figure  point  to  our  compiler’s  targets  of  optimization.  DIVA’s 
Wide  functional  unit  has  operations  similar  to  a  multimedia  extension  architecture  such  as  the 
PowerPC  AltiVec,  where  the  data  type  is  larger  than  a  machine  word,  and  can  be  configured  to 
perform  SIMD  parallel  operations  on  different  field  widths,  8-bit,  16-bit  and  32-bit.  This  type  of 
fine-grain  parallelism  is  referred  to  as  superword-level  parallelism  (SLP).  Optimizations 
targeting  SLP  are  the  first  priority  of  our  compiler.  The  second  priority  relates  to  the  WideWord 
register  file,  which  is  1Kbyte  of  storage  very  close  to  the  processor,  and  the  fact  that  our 
architecture  does  not  have  a  data  cache.  Our  target  applications  that  can  exploit  the  bandwidth  of 
the  WideWord  datapath  could  also  benefit  from  the  increased  bandwidth  and  lower  latency  of  a 
data  cache,  as  compared  to  accessing  from  the  DRAM  array.  For  this  same  class  of  applications, 
however,  compiler  technology  can  also  derive  the  data  access  patterns  and  manage  storage 
explicitly.  For  this  purpose,  we  have  developed  new  optimizations  in  the  DIVA  compiler  to 
support  compiler-controlled  caching  in  the  WideWord  register  file.  Further  optimization  benefits 
are  obtained  from  exploiting  spatial  locality  in  the  DRAM  array.  When  the  application  accesses 
memory,  the  latency  of  a  memory  access  varies  depending  upon  whether  the  access  is  nearby  the 
previous  access.  The  DRAM  first  selects  a  page  or  row  (assumed  to  be  2048  bits)  and  then  a 
256-bit  or  32-bit  column  within  that  row.  Accesses  to  the  same  row  as  the  previous  access  are 
referred  to  as  pagemode  accesses,  and  have  a  3x  lower  latency  than  other  accesses,  which  are 
said  to  be  in  random  mode.  Our  compiler  performs  optimizations  to  maximize  the  number  of 
memory  accesses  that  are  in  page  mode. 

Figure  13  illustrates  the  components  of  the  DIVA  compiler.  The  DIVA  front-end  compiler  is 
based  on  SUIT,  a  research  compiler  infrastructure  developed  at  Stanford  University.  The  SUIF- 
based  DIVA  front  end  takes  as  input  a  C  or  Fortran  program  and  generates  optimized  code  in 
MrC,  a  C-like  language  with  extensions  for  superword-level  parallelism  developed  for  the 
PowerPC  AltiVec.  The  optimized  MrC  code  is  the  input  to  the  DIVA  compiler  backend,  as 
shown  in  Figure  13. 

The  DIVA  compiler  backend  is  based  on  a  superword-extended  AltiVec  GCC  backend  available 
from  Motorola.  The  AltiVec  GCC  backend  takes  MrC  code  and  generates  AltiVec  vector 
instructions  similar  to  DIVA  WideWord  instructions.  To  generate  DIVA  PIM  code,  we 
integrated  the  DIVA  GCC  backend  that  previously  generated  DIVA  scalar  code  only  with  the 
AltiVec  GCC  backend.  The  final  DIVA  GCC  backend  generates  code  that  uses  both  PIM  scalar 
and  WideWord  instructions. 

Figure  13  shows  the  DIVA  GCC  backend  and  the  AltiVec  GCC  backend  for  illustration  purposes, 
as  both  take  optimized  code  from  the  SUIF-based  front-end  compiler.  The  AltiVec  backend  was 
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a  useful  tool  for  testing  and  tuning  optimizations  performed  by  the  SUIF -based  front-end 
compiler  during  the  time  the  DIVA  PIM  chip  was  not  yet  available  for  software  experiments. 

The  remainder  of  this  section  describes  the  optimizations  performed  by  our  frontend  compiler, 
the  implementation,  and  performance  results. 
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Figure  13,  DIVA  PIM  Compiler  Technology 

5.1  DIVA  PIM  front-end  compiler 

To  develop  a  DIVA  PIM  compiler  that  automatically  generates  optimized  code  targeting 
superword-level  parallelism,  we  have  collaborated  with  Saman  Amarasinghe  and  Samuel  Larsen 
at  MIT.  The  initial  MIT  SUIF-based  compiler  automatically  recognizes  SLP  and  generates 
optimized  code  targeting  the  PowerPC  AltiVec  multimedia  instructions.  The  DIVA  compiler  is 
built  upon  the  MIT-SLP  implementation  and  generates  code  targeting  DIVA’s  WideWord 
instructions. 

In  addition  to  superword- level  parallelism,  the  DIVA  SUIF-based  compiler  performs 
optimizations  for  compiler-controlled  caching  in  the  wide  register  file.  We  developed  and 
implemented  new  analyses  for  identifying  temporal  and  spatial  reuse  of  data  in  loop  nest 
computations.  Our  compiler  performs  a  new  optimization  called  superword  replacement, 
whereby  accesses  to  superwords  in  memory  are  replaced  by  accesses  to  temporary  registers,  so 
that  the  DIVA  backend  register  allocator  tries  to  keep  these  temporaries  in  wide  registers.  This 
approach  adapts  related  techniques  for  exploiting  temporal  reuse  in  scalar  registers,  but  must  also 
account  for  parallelism  and  spatial  reuse. 
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The  DIVA  SUIF-based  front-end  eompiler  automatieally  generates  optimized  MrC  eode  for  six 
scientific/multimedia  benehmarks:  TOMCATV  and  SWIM  from  the  SPEC'95  benchmark  suite, 
and  the  media  kernels  VMM  (vector-matrix  multiply),  MMM  (matrix-matrix  multiply),  FIR 
(Finite  Impulse  Response  Filter)  and  YUV  (RGB  to  YUV  conversion). 

We  also  completed  an  implementation  and  experiment  in  our  DIVA  compiler  to  automatically 
reorder  memory  accesses  to  achieve  page-mode  memory  accesses,  rather  than  random-mode 
memory  accesses,  and  thus  greatly  reduce  memory  latency.  The  compiler  unrolls  inner  loops  and 
reorders  memory  accesses  when  there  are  no  data  dependencies  that  prevent  doing  so,  such  that 
accesses  within  the  same  page  are  performed  consecutively.  On  four  of  the  above  benchmarks, 
VMM,  MMM,  YUV  and  FIR,  we  observed  speedups  ranging  from  1.25  to  2.19X  on  the  DIVA 
simulator,  as  compared  to  not  performing  the  reordering  of  memory  accesses.  This  work  has 
been  reported  in  two  publications  [Chame00][Shin02b]. 

Under  DIVA  funding,  we  also  began  an  evaluation  of  requirements  to  extend  MIT-SFP  so  that  it 
can  parallelize  more  programs  of  interest,  such  as  the  DIS  Transitive  Closure  stressmark  and 
NAS  CG.  We  have  identified  the  need  to  extend  MIT-SFP  to  support  parallelization  of 
constructs  containing  conditionals  for  Transitive  Closure,  and  to  optimize  movement  of  data 
between  scalar  and  wide  register  files,  since  movement  between  register  files  is  not  supported  in 
the  AltiVec. 

5.2  DIVA  PIM  backend  compiler 

As  the  AltiVec  GCC  backend  was  an  experimental  and  unsupported  system,  we  encountered  a 
number  of  challenges  in  merging  the  DIVA  GCC  backend  with  the  AltiVec  component. 
Determining  which  GCC  patches  to  integrate  and  which  to  omit  required  a  lot  of  information 
gathering  and  trial-and-error.  We  successfully  completed  the  integration,  and  began  porting  the 
AltiVec  GCC  backend  to  generate  DIVA  WideWord  code.  Under  DIVA  funding,  we 
implemented  a  subset  of  DIVA  WideWord  instructions  and  the  GCC  backend  generated 
WideWord  code  for  VMM,  a  kernel  that  performs  a  vector-matrix  multiply.  The  AltiVec  version 
of  the  compiler  has  generated  code  for  many  more  applications,  as  discussed  in  more  detail 
below. 

We  have  performed  extensive  experiments  with  the  optimized  code  generated  by  our  compiler, 
for  both  DIVA  and  AltiVec.  The  experiments  were  performed  both  in  an  instruction  simulator  of 
the  DIVA  ISA  and  in  the  PowerPC  G4  (with  an  AltiVec).  The  optimizations  for  data  reuse  in 
WideWord  registers  result  in  a  reduction  in  scalar  memory  accesses  of  over  90%  for  the  four 
kernels  and  over  35%  in  SWIM  and  TOMCATV.  In  addition,  we  observe  a  reduction  of 
WideWord  memory  accesses  of  over  50%  for  three  of  the  four  kernels,  and  over  85%  in  SWIM 
and  TOMCATV.  These  reductions  indicate  that  even  more  improvement  can  be  expected  on 
DIVA,  where  there  is  no  data  cache.  On  the  AltiVec,  overall  we  are  showing  speedups  ranging 
from  1.7X  to  12. 3X  over  scalar  execution,  with  an  average  of  4.2X.  Speedups  due  to  our 
compiler  optimizations  for  compiler-controlled  caching  go  from  1.3  to  2.8,  with  an  average  of  = 
2.2,  over  the  MIT-SFP  compiler  upon  which  we  base  our  implementation.  This  work  has  been 
reported  in  three  publications  [Chame00][Shin02a]  [Shin03]. 
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5.3  Additional  Compiler  Research 

Beyond  the  node  eompiler  implementation,  we  planned  a  long-term  strategy  for  system-level 
eompilation  {i.e.,  host  and  multiple  PIMs)  that  is  being  pursued  under  separate  funding.  As  was 
diseussed  in  the  eontext  of  the  DIVA  system  arehiteeture,  we  designed  DIVA  sueh  that  it  eould 
be  programmed  using  eonventional  solutions  from  parallel  eomputing,  rather  than  requiring  a 
programming  paradigm  speeifie  to  DIVA  or  to  PIMs.  As  a  system-level  programming  strategy, 
we  have  adopted  Unified  Parallel  C  (UPC),  a  relatively  new  parallel  programming  language. 
UPC  was  developed  as  a  unifieation  of  the  best  ideas  among  several  researeh  C  eompilers  that 
support  a  global  address  spaee,  and  allow  high-level  speeifieation  of  data  distribution  in  an 
SPMD  (Single  Program  Multiple  Data)  abstraction  for  highend  shared-memory,  distributed- 
shared-memory  and  even  distributed-memory  parallel  systems.  The  development  of  the  UPC 
language  and  its  implementations  has  been  motivated  by  DoD  interest  and  support.  There  are 
several  commercial  UPC  compilers,  and  there  are  a  number  of  defense  applications  already 
written  in  UPC.  We  chose  UPC  for  all  these  reasons,  as  well  as  the  fact  that  we  can  develop 
DIVA  target  applications  that  are  pointer-based  in  a  C-based  language,  but  cannot  in  other 
parallel  programming  languages  such  as,  for  example,  CoArray  Fortran. 

As  part  of  future  work,  we  are  collaborating  with  Lawrence  Berkeley  Laboratories  and  UC 
Berkeley  to  develop  a  UPC  compiler  for  the  DIVA  prototype.  They  have  an  ongoing  UPC 
compiler  effort,  to  develop  a  portable  UPC  compiler. 

6.  System  Simulator 

We  developed  a  simulator  of  the  DIVA  system  architecture  that  was  used  throughout  the 
duration  of  the  project  for  several  application  and  architectural  studies.  Among  these  studies 
were  the  investigation  of  performance  of  data-intensive  applications  on  DIVA,  the  analysis  of 
architectural  design  trade-offs  and  bottlenecks  and  studies  that  evaluated  and  provided  feedback 
to  the  design  of  the  DIVA  Instruction  Set  Architecture  (ISA). 

The  DIVA  system  simulator  (DSIM)  uses  RSIM  (http://rsim.cs.uiuc.edu/rsim)  as  a  framework, 
with  significant  extensions.  RSIM  is  an  event-driven  simulator  that  models  shared-memory 
multiprocessors  built  with  state-of-the-art  multiple-issue,  out-of-order  superscalar  processors. 
DSIM  extensions  include  a  simpler  PIM  processor  with  a  WideWord  unit,  the  DIVA  memory 
system,  the  parcel  communication  mechanism  and  the  PIM-to-PIM  interconnect.  DSIM  supports 
the  DIVA  PIM  ISA. 

The  DSIM  host  processor  is  taken  directly  from  RSIM,  as  well  as  the  host  first  and  second-level 
caches.  The  host  processor  architecture  is  based  on  the  MIPS  RIOOOO,  which  is  configured  as  a 
four-issue  processor  with  two  integer  arithmetic  units,  two  floating-point  units  and  one  address 
unit.  Loads  are  non-blocking.  It  has  a  32Kbyte  LI  and  a  I  Mbyte  L2  cache,  both  two-way 
associative,  with  access  times  of  I  and  10  cycles,  respectively.  Both  LI  and  L2  caches  are 
pipelined  and  support  multiple  outstanding  requests  to  distinct  cache  lines. 

The  host  is  connected  to  the  DIVA  memory  system  via  a  split-transaction,  64-bit  bus.  The 
memory  system  consists  of  the  aggregation  of  all  PIM  memories,  where  each  local  memory  is 
visible  from  both  host  and  local  PIM  processor.  DSIM  maintains  the  current  open  row  of  each 
memory  bank  to  determine  the  memory  access  type  (page  or  random  mode)  and  simulates 
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arbitration  between  host  and  PIM  aeeesses.  The  memory  lateneies  seen  by  the  host  are  52  eyeles 
for  page-mode  aeeesses  and  60  eyeles  for  random  mode,  and  inelude  the  bus  transfer  delay,  the 
memory  arbitration  time  and  the  DRAM  aeeess  time  (4  and  12  eyeles  for  page  and  random  mode, 
respectively).  The  memory  latencies  seen  by  the  local  PIM  processor,  including  arbitration  and 
DRAM  access  times,  are  6  and  14  cycles  for  page-  and  random-mode  accesses,  respectively. 

DSIM  also  models  the  parcel  mechanism  and  the  PIM-to-PIM  interconnection  in  detail. 
Applications  executing  on  DSIM  have  direct  access  to  the  parcel  buffers  via  parcel  handling 
functions  that  perform  the  writing/reading  to/from  the  memory  mapped  parcel  buffers.  These 
parcel  handling  functions  are  part  of  DSIM's  application  library,  and  support  the  full  set  of  parcel 
buffer  status  reads,  triggering/non-triggering  writes  to  the  send  parcel  buffers  and 
destructive/nondestructive  reads  from  the  receive  parcel  buffers. 

The  application  library  also  supports  a  cache-line-flush  function  to  enforce  coherence  between 
the  host  caches  and  PIM  memory,  and  synchronization  functions.  The  functions  in  the 
application  library  are  linked  with  the  application  code,  and  their  execution  is  simulated  by 
DSIM  as  part  of  the  application. 

The  simulator  parameters  used  in  our  application  studies  were  based  on  the  conservative 
assumption  that  the  PIM  processor  runs  at  half  the  speed  of  the  host  processor.  Although  the 
inherent  speed  of  the  logic  is  no  slower,  we  make  this  assumption  because  the  WideWord 
register  accesses  could  impact  the  clock  speed. 

7.  Application  Studies 

We  performed  several  application  studies,  using  the  DIS  Stressmark  Suite  as  well  as  other  data- 
intensive  or  high-performance-computing  benchmarks,  including  NAS  CG  and  the  template¬ 
matching  (TM)  component  of  the  Sandia  ATR  benchmark.  We  first  describe  the  DIVA 
implementations  of  the  DIS  stressmarks,  then  we  present  experimental  results  on  the  stressmarks 
and  other  benchmarks,  and  later  we  discuss  our  earlier  application  studies. 

7.1  DIS  Stressmarks 

This  section  contains  a  description  of  our  implementation  of  the  Cornertum,  Pointer,  Transitive 
Closure  and  Neighborhood  stressmarks.  For  each  of  these  stressmarks,  we  describe  how  the 
stressmark  is  mapped  to  DIVA,  including  computation  and  data  partitioning,  host-and-PIM  and 
PIM-to-PIM  communication  and  synchronization.  We  also  describe  how  the  WideWord  unit  is 
used,  when  applicable  (Pointer  and  Neighborhood  do  not  use  the  PIM  WideWord  unit). 

Cornertum. 

The  DIVA  implementation  of  Cornertum  performs  a  hierarchical  matrix  transpose,  where  the 
matrix  is  partitioned  into  blocks  and  each  block  is  assigned  to  a  PIM  node.  The  transpose  of  each 
block  is  computed  by  partitioning  the  block  into  sub-blocks,  which  are  then  transposed  in 
WideWord  registers  using  permutation  operations.  We  present  below  a  simplified 
implementation,  which  is  valid  for  square  matrices  only. 

The  host  performs  the  initial  block  partitioning,  keeping  a  table  with  the  assignment  of  blocks  to 
PIMs,  and  coordinates  synchronization  between  host  and  PIMs.  In  the  first  phase  of  the 
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computation,  each  PIM  computes  the  transpose  of  its  local  block.  After  that  eaeh  pair  of  PIMs 
owning  blocks  that  need  to  be  swapped  to  form  the  transposed  matrix  eommunicate  using  the 
PIM-to-PIM  network. 

The  local  block  transpose  is  performed  as  a  set  of  transposes  of  8x8  sub-blocks  (except  for  block 
sizes  that  are  not  multiple  of  the  number  of  matrix  elements  that  fit  in  a  WideWord  register).  For 
the  out-of-place  transpose,  each  8x8  sub-block  is  loaded  into  the  WideWord  register  fide  (an  8x8 
matrix  with  32-bit  elements  requiring  8  WideWord  registers),  and  transposed  via  a  sequence  of 
permutation  operations.  The  transposed  sub-bloek  is  then  stored  baek  in  memory  at  the  target 
location.  In  the  in-plaee  transpose  (of  square  bloeks)  two  subbloeks  of  size  8x8  are  loaded  in 
WideWord  registers,  each  sub-block  is  transposed  in  registers,  and  then  the  transposed  sub¬ 
bloeks  are  stored  back  in  memory,  swapping  locations  to  form  the  transposed  block.  This 
implementation  takes  advantage  of  the  large  capacity  of  the  WideWord  register  file,  avoiding 
loads  and  stores  to  memory  during  the  transpose  of  each  8x8  sub-block. 

After  eomputing  its  local  transposed  block,  each  PIM  exchanges  its  transposed  block  with  the 
PIM  that  owns  the  location  of  the  block  in  the  transposed  matrix.  For  example,  for  a  square 
matrix  divided  into  four  bloeks  where  bloek-00  is  assigned  to  PIM-0,  block-01  to  PIM-1,  block- 
10  to  PIM-2  and  block- 11  to  PIM-3,  PIM-1  exchanges  its  transposed  block  with  PIM-2.  PIM-0 
and  PIM-3  keep  their  transposed  bloeks  since  they  should  remain  in  the  same  loeation  in  the 
transposed  matrix. 

The  eommunieation  phase  is  performed  in  2  steps:  in  the  first  step  PIMs  owning  bloeks  in  the 
upper  triangular  sub-matrix  send  their  blocks  to  PIMs  owning  blocks  in  the  lower  triangular  sub¬ 
matrix;  the  seeond  step  eompletes  the  exehange  of  bloeks  with  PIMs  in  the  lower  triangular  sub¬ 
matrix  sending  blocks  to  PIMs  in  the  upper  triangular  sub-matrix. 

Finally,  this  implementation  of  Cornertum  avoids  contention  on  the  PIM-to-PIM  network  by 
assigning  each  pair  of  blocks  that  will  exchange  loeations  in  the  transposed  matrix  to  neighbor 
PIMs.  This  assignment  is  based  on  the  faet  that  eommunieation  occurs  between  fixed  pairs  of 
PIMs,  and  that  when  assigning  a  bloek  to  a  PIM  it  is  possible  to  determine  the  location  of  its 
transposed  bloek  in  the  transposed  matrix,  and  then  assign  the  block  corresponding  to  this 
loeation  to  the  nearest  PIM  available. 

Our  HOST  version  of  Cornertum  shows  high  memory  stall  times  for  input  sizes  that  do  not  fit  in 
the  host  L2  caehe.  This  applieation  has  very  little  temporal  reuse,  since  each  matrix  element  is 
aceessed  a  few  times  only  during  eaeh  matrix  transpose.  Thus  primarily  spatial  reuse  is 
exploited  in  eaehe,  and  eaeh  new  caehe  line  is  only  reused  a  few  times.  In  the  PIM  version,  the 
WideWord  datapaths  also  exploit  the  available  spatial  reuse.  Furthermore,  the  WideWord 
loads/stores  and  operations  on  eight  matrix  elements  at  a  time  also  reduce  the  number  of  aeeesses 
to  memory.  Finally,  the  latency  seen  by  the  PIM  proeessor  is  lower  than  that  suffered  by  the  host 
for  large  input  sizes.  For  example,  a  1024x1024  matrix  is  four  times  larger  than  the  host  L2 
eaehe,  resulting  in  memory  stall  times  corresponding  to  98%  of  the  host  exeeution  time.  On  the 
other  hand,  the  1-PIM  version  spends  40%  of  the  exeeution  time  stalled  for  memory,  due  to  the 
lower  on-ehip  latencies  and  a  reduction  on  the  number  of  memory  aeeesses  (the  average  latency 
seen  by  the  PIM  is  1 1.6  eycles,  since  most  of  the  aeeesses  are  in  random  mode). 
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Transitive  Closure 

The  implementation  of  Transitive  Closure  for  DIVA  is  based  on  the  DIS  sample  code,  and  uses  a 
dense  matrix  to  represent  the  distance  graph.  It  exploits  both  fine-grain  parallelism,  by 
performing  WideWord  arithmetic  operations  on  eight  32-bit  elements  of  the  matrix  in  parallel, 
and  coarse-grain  parallelism,  by  partitioning  the  data  and  computation  among  PIM  nodes. 

The  host  processor  computes  the  matrix  partition  and  coordinates  synchronization.  Matrices  din 
and  dout  are  partitioned  by  rows  and  a  set  of  consecutive  rows  is  assigned  to  each  PIM  node.  For 
the  main  loop  nest  of  Transitive  Closure,  for  each  iteration  of  the  outer  loop  k,  each  PIM  node 
performs  the  inner-loop  computation  (loops  i  and  j)  on  its  local  set  of  rows,  using  a  copy  of  row  k 
previously  sent  by  the  PIM  that  owns  row  k.  Therefore,  for  each  iteration  of  loop  k,  the  PIM  node 
that  owns  row  k  sends  a  copy  of  this  row  to  all  other  PIMs.  All  PIM  nodes  synchronize  on  each 
iteration  of  loop  k,  after  the  communication  phase. 

The  multicast  of  a  matrix  row  from  one  PIM  to  all  other  PIMs  is  performed  using  the  multicast 
mode  supported  by  the  DIVA  parcel  buffer  mechanism.  The  sender  processor  writes  a  parcel 
payload  to  the  parcel  buffer,  and  then  writes  a  parcel  header  for  each  destination  PIM.  The  write 
to  the  parcel  header  triggers  the  sending  of  the  parcel  to  the  specified  destination.  This  multicast 
mode  allows  the  sender  processor  to  write  the  parcel  payload  only  once,  reducing  the  cost  of 
assembling  parcels  in  the  parcel  buffer. 

The  local  computation  on  each  PIM  node  takes  advantage  of  the  WideWord  unit  in  the 
computation  of  the  minimum  value  of  each  pair  of  elements  from  two  matrix  rows.  Selective 
execution  using  a  WideWord  operation  (wmrgcc)  merges  the  contents  of  two  WideWord 
registers  according  to  condition-code  bits,  allowing  an  efficient  computation  of  the  minimum 
value  of  each  pair  of  elements  of  two  WideWord  operands. 

Finally,  for  both  the  HOST  and  PIM  versions,  the  inner  loops  (loops  i  and  j)  of  the  main  loop 
nest  were  interchanged,  so  that  the  HOST  can  benefit  from  spatial  locality  at  the  caches,  and 
PIMs  can  exploit  spatial  reuse  in  WideWord  registers. 

Our  PIM  implementation  benefits  from  fine-grain  and  coarse-grain  parallelism,  and  also  from  the 
higher  bandwidths  available  on  chip.  For  example,  the  HOST  version  for  input  tc05.in  spends 
65.2%  of  its  execution  time  stalled  due  to  cache  misses,  with  1 1.3%  of  the  misses  satisfied  at  the 
LI  and  58.4%  satisfied  at  the  L2,  resulting  in  an  average  memory  latency  of  6.7  cycles.  The  1- 
PIM  version  shows  a  higher  average  memory  latency  (9.5  cycles),  but  it  issues  less  memory 
accesses,  since  the  WideWord  unit  is  used  to  transfer  data  to/from  memory  and  perform  the 
computation.  Therefore  the  1-PIM  memory  stall  time  is  smaller  than  that  of  the  HOST  version. 
The  use  of  the  WideWord  unit  also  results  in  exploiting  spatial  reuse,  since  the  matrix  is 
accessed  with  stride  one  in  the  row  dimension. 

Pointer 

Our  implementation  of  Pointer  is  based  on  the  sample  code  provided  by  Atlantic  Aerospace.  We 
mapped  Pointer  to  DIVA  by  partitioning  both  threads  and  the  field  array  among  PIM  nodes.  To 
reduce  communication  costs,  PIM  nodes  are  partitioned  into  groups  so  that  each  group  has  a 
copy  of  the  array;  the  size  of  each  group  is  the  minimum  number  of  PIM  nodes  required  to  keep 
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one  copy  of  the  array.  For  example,  for  a  4  MByte  array  and  16  PIM  nodes,  and  assuming  that 
each  PIM  node  can  keep  2  MBytes  of  data,  the  PIMs  would  be  partitioned  into  8  groups  of  2 
PIMs,  each  group  keeping  a  copy  of  the  array. 

Each  PIM  node  is  initially  assigned  a  set  of  threads.  Each  PIM  node  starts  a  thread  (from  its  own 
set)  and  proceeds  as  follows: 

1 .  When  a  "hop"  is  to  a  location  mapped  to  the  PIM,  it  computes  the  median  and  next  hop 
as  in  the  original  sample  code. 

2.  When  a  "hop"  is  to  a  location  mapped  to  a  remote  PIM  node,  it  sends  the  "hop"(in  a 
parcel)  to  the  remote  node,  which  will  then  continue  hoping  on  this  thread. 

3.  After  sending  a  remote  hop  out,  the  PIM  checks  if  it  has  received  any  parcels  containing 
"hops"  to  be  executed  locally.  If  there  is  a  parcel,  it  goes  to  step  1. 

4.  When  a  thread  is  completed,  the  PIM  node  that  executed  the  last  hop  marks  the  thread 
"done"  and  sends  a  parcel  to  the  PIM  that  owns  that  thread  signaling  that  the  thread  is 
done. 

Einally,  the  host  processor  checks  for  threads  that  are  done  and  signals  the  PIMs  when  all  threads 
are  done. 

In  our  experiments,  the  HOST  version  performs  better  than  the  I -PIM  version  when  the  input 
size  fits  in  the  host  El  or  L2  caches  (as  in  p05.in  and  p20.in).  The  PIM  version  performs  better 
than  the  host  version  when  the  input  data  set  fits  in  one  PIM  node  and  does  not  fit  in  the  host 
cache  (such  data  is  not  reported  since  none  of  the  DIS  input  sizes  satisfies  this  condition).  Our 
PIM  version  of  Pointer  does  not  speedup  when  the  array  must  be  partitioned  among  PIMs.  The 
main  reason  our  Pointer  does  not  scale  well  is  that  the  rate  of  communication  per  hops  is  very 
small,  and  the  local  computation  (an  average  of  a  couple  of  hops)  is  not  enough  to  amortize  the 
cost  of  PIM-to-PIM  communication. 

Neighborhood 

The  Neighborhood  implementation  on  DIVA  exploits  coarse-grain  parallelism  by  partitioning 
the  computation  among  PIM  nodes.  Each  PIM  computes  a  partial  histogram  locally,  and  at  the 
end  of  the  computation  phase,  the  PIM  nodes  perform  a  parallel  reduction  to  compute  the  final 
histogram.  The  parallel  reduction  takes  n-\  steps,  where  n  is  the  number  of  PIM  nodes.  The 
communication  is  scheduled  to  take  advantage  of  the  PIM-to-PIM  interconnection  topology  (bi¬ 
directional  ring),  avoiding  contention  in  the  network. 

The  1-PIM  version  of  Neighborhood  performs  worse  than  the  host  version  when  the  image  fits  in 
the  host  E2  cache,  for  several  reasons:  the  memory  latencies  seen  by  the  PIM  are  larger  than  the 
E2  access  time;  the  PIM  nodes  operate  at  half  the  speed  of  the  host;  and  our  implementation  of 
Neighborhood  does  not  take  advantage  of  the  WideWord  unit.  When  coarse-grain  parallelism  is 
exploited  by  partitioning  the  computation  among  several  PIM  nodes,  the  PIM  version  speeds  up 
considerably  with  respect  to  the  host. 

7.2  Experimental  evaluation 

1-PIM  performance 

To  measure  the  performance  potential  of  the  DIVA  architecture,  we  examine  in  detail  eight 
benchmark  applications,  summarized  in  the  Table  1. 
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Table  1.  Summary  of  the  eight  benchmark  applications 


Program 

Description 

Source 

Data  Set  Size 

WideWord  Usage 

Template 

Matching  (TM) 

image 

correlation 

Sandia 

4-Kbyte  image, 

32  1 -Kbyte 
templates 

parallelism, 
selective,  reuse 
in  registers,  page 
mode 

Corncrturn  (CT) 

matrix 

transpose 

Atlantic 

Aerospace 

32-Mbyte  matrix 

parallelism, 

permutation 

CG 

Transitive 
Closure  (TC) 

sparse 
conjugate 
gradient 
Floyd’s  all¬ 
paths  shortest 
paths 

NAS 

Atlantic 

Aerospace 

2M  double¬ 
precision 
elements 

256  Kbytes 

parallelism, 
floating-point, 
page  mode 
parallelism, 
selective,  reuse 
in  registers 

Neighborhood 

(NH) 

relational 
database  join 

Atlantic 

Aerospace 

500,000  bytes 

Natural  Join 
(NJ) 

image 

processing 

stencil 

Alphatech 

72  Kbytes 

Pointer  (P) 

random  walk 

Atlantic 

Aerospace 

4  Mbytes 

007 

object- 

oriented 

database 

query 

University  of 
Wisconsin 

888  Kbytes 

These  applieations  span  a  broad  range  of  domains  ineluding  seientific  eomputing,  databases  and 
image  proeessing.  They  exhibit  both  eoarse  grain  parallelism  (which  allows  computation  to  be 
spread  across  PIMs)  and,  in  some  cases,  fine  grain  parallelism  (which  can  be  exploited  through 
execution  in  the  WideWord  unit).  CG,  Neighborhood,  Pointer,  007  and  Natural  Join  exhibit 
irregular  or  mixed  (regular  and  irregular)  data  access  patterns,  resulting  in  high  memory  access 
overheads  on  conventional  architectures.  Comertum,  Transitive  Closure  and  Template  Matching 
are  dense  matrix  computations  with  regular  access  patterns,  although  memory  bandwidth 
becomes  a  limiting  factor  in  exploiting  the  significant  available  parallelism.  These  three  and  CG 
rely  on  the  WideWord  unit  to  exploit  parallelism  and  PIM  bandwidths.  Hereon,  we  use 
abbreviations  for  each  of  the  program  names,  with  a  suffix  -H  for  host  and  -P  for  PIM. 
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The  graph  in  Figure  14  summarizes  1-PIM  performance  as  compared  to  execution  on  the 
conventional  host  processor.  Five  of  the  eight  programs  speed  up  significantly  compared  against 
host  execution,  two  remain  about  the  same,  and  one  program  is  slowed  down.  (All  programs 
speed  up  when  multiple  PlMs  are  used.)  Overall,  the  average  speedup  is  3.39X. 


Figure  14,  Summary  of  1-PIM  performance  relative  to  host 

Several  factors  contribute  to  these  speedups,  including  the  lower  memory  stall  times  on  the  PIM 
nodes  and  the  benefits  of  the  WideWord  unit  in  exploiting  fine-grain  parallelism  and  taking 
advantage  of  page-mode  memory.  These  factors  are  discussed  in  detail  in  the  subsections  that 
follow. 


Reduction  in  Memory  Stall  Time 

To  illustrate  the  impact  of  memory  latencies  on  the  applications’  total  execution  times.  Figure  15 
shows  the  busy  and  memory  stall  components  of  host  only  execution.  We  see  from  the  figure 
that  five  of  the  eight  programs  spend  more  than  40%  of  their  time  stalled  in  memory  accesses. 


Figure  15,  Host-only  busy  and  memory  stall  times  for  the  eight  programs 
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PIMs  reduce  memory  stall  time  in  two  ways:  (1)  lower  latency  to  memory;  and,  (2)  higher 
bandwidth  to  memory  through  wide  loads  and  stores.  (A  third  reduction  occurs  as  a  result  of 
coarse-grain  parallelism  across  the  PIMs.)  DIVA  achieves  a  reduction  in  memory  stall  time  for 
these  five  programs  ranging  from  13.89%  for  Natural  Join  to  95%  for  Comertum,  as  shown  in 
Figure  16. 


Host-only  and  l-PIM  memory  stall  times 
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Figure  16,  Memory  stall  times  of  host-only  and  1-PIM  execution 

The  host  version  of  Template  Matching  (TM-H)  has  a  memory  stall  time  of  only  3%  of  its  total 
execution  time.  The  reason  is  that  the  data  set  size  fits  in  the  L2  host  cache  and  the  working  set 
of  each  loop  fits  in  the  LI  cache,  and  therefore  the  data  reuse  exhibited  by  TM  is  effectively 
exploited.  Even  though  TM-H  does  not  suffer  from  large  memory  stall  times,  the  1-PIM  version 
(TM-P)  has  even  smaller  stall  times  due  to  the  high  data  bandwidth  at  the  PIM  node.  The  use  of 
the  WideWord  unit  for  loading/storing  and  operating  on  256-bit  objects,  plus  the  reuse  of  data  in 
WideWord  registers  reduces  the  memory  stall  time  to  20%  of  that  of  TM-H. 

Comerturn  has  a  memory  stall  time  of  90.17%  when  running  on  the  host.  This  application  has 
very  little  temporal  reuse,  since  each  matrix  element  is  accessed  only  twice  (one  read  and  one 
write)  during  the  matrix  transpose.  Thus  primarily  spatial  reuse  is  exploited  in  cache,  and  each 
new  cache  line  is  only  reused  a  few  times  (1  load  and  1  store  per  element,  and  8  elements  per 
cache  line)  once  loaded,  and  then  never  used  again.  In  the  PIM  version,  the  WideWord  datapaths 
also  exploit  the  available  spatial  reuse.  Furthermore,  the  WideWord  loads/stores  and  operations 
on  8  matrix  elements  at  a  time  also  reduce  the  number  of  accesses  to  memory. 

Finally,  the  latency  seen  by  the  PIM  processor  (average  of  11.57  cycles,  since  most  of  the 
accesses  are  in  random  mode)  is  much  lower  than  that  suffered  by  the  host.  The  combination  of 
these  factors  reduces  the  CT-P  memory  stall  time  to  4.32%  of  that  of  CT-H. 

CG  also  benefits  from  the  lower  memory  latencies  on  the  PIM  node.  Since  the  data  set  size  does 
not  fit  in  the  host  caches  and  the  irregular  access  patterns  cause  conflict  misses,  CG-H  spends 
85.21%  of  its  execution  time  stalled  due  to  cache  misses.  Although  most  of  the  misses  are 
satisfied  at  the  F2  cache  (5 1 .32%),  46%  of  the  stall  time  is  due  to  accesses  to  the  DRAM.  On  the 
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PIM,  78%  of  the  memory  aeeesses  are  page-mode  aeeesses,  and  the  average  lateney  seen  by  the 
proeessor  is  only  5.91  cyeles. 

TC-P  benefits  from  both  fine-grain  parallelism  and  the  higher  bandwidths  available  on  chip.  TC- 
H  spends  70%  of  its  execution  time  stalled  due  to  cache  misses,  with  47.14%  of  the  misses 
satisfied  at  the  LI  and  52.81%  satisfied  at  the  L2,  resulting  in  an  average  miss  latency  of  6.23 
cycles.  On  the  PIM  version,  the  average  memory  latency  is  of  5.57  cycles,  due  to  67%  of  page¬ 
mode  accesses.  In  addition  to  lower  memory  latencies,  TC-P  also  has  a  smaller  number  of 
memory  accesses  since  the  Wide  Word  unit  is  used  to  transfer  the  data  to/from  memory  and 
perform  the  computation.  Therefore  the  memory  stall  time  of  TC-P  is  smaller  than  that  of  the 
host  version.  The  use  of  the  WideWord  unit  also  results  in  the  added  benefit  of  exploiting  spatial 
reuse;  since  the  matrix  is  accessed  with  stride  one  in  the  row  dimension. 

Neighborhood  shows  an  increase  in  memory  stall  time  because  the  data  fits  in  cache,  and  thus 
the  memory  latency  at  the  PIM  is  larger  than  that  of  the  host.  This  increase  in  memory  stall  time 
and  the  fact  that  the  PIM  processor  runs  at  half  the  speed  of  the  host  results  in  a  slowdown  with 
respect  to  host-only  execution. 

Pointer  has  no  spatial  reuse  and  little  temporal  reuse,  and  since  the  data  set  size  is  larger  than  the 
L2  cache,  P-H  stalls  for  memory  for  49.8%  of  its  execution  time,  with  most  misses  satisfied  at 
the  DRAM.  P-P  has  roughly  the  same  number  of  loads  and  stores,  but  the  average  latency  seen 
by  the  PIM  is  much  smaller  than  the  memory  latency  suffered  by  the  host,  even  though  most  of 
the  PIM  accesses  are  random-mode  accesses. 

Natural  Join  has  little  temporal  reuse  and  high  cache  miss  rates,  even  though  the  data  set  size  fits 
in  the  L2  cache.  NJ-P  shows  a  reduction  of  13.8%  in  memory  stall  times  due  to  the  lower 
average  latency  seen  by  the  PIM  processor.  007  also  has  almost  no  temporal  reuse  and  007-H 
suffers  from  a  large  amount  of  cache  misses.  On  the  PIM  version  the  memory  stall  time  is 
reduced  by  62.8%,  again  as  a  result  of  the  smaller  on-chip  latency. 

Benefits  from  WideWord  Unit  and  Page  Mode  Memory  Accesses, 

To  isolate  the  benefits  of  the  WideWord  unit,  we  compare  scalar  versions  against  versions  tuned 
to  take  advantage  of  the  WideWord  unit  and  page-mode  memory  accesses  for  the  four  programs 
that  utilize  the  wide  datapaths.  These  results  are  shown  in  Figure  17.  Speedups  are  significant, 
ranging  froml.l9X  for  CG  up  to  17.96X  for  TM,  with  an  average  improvement  of  9.93X. 
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Speedup  of  1-PIM  with  superword-level  parallelism  over  1-PIM  scalar 
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Figure  17,  Benefits  of  WideWord  instructions  and  page-mode  memory  accesses 

CG's  key  computation  is  a  sparse  matrix-vector  multiply.  Due  to  the  mixed  regular/irregular 
nature  of  data  accesses,  we  only  exploit  fine-grain  parallelism  in  the  WideWord  unit  for  the 
regular  portions  of  the  computation.  The  dense  vector  accesses  are  loaded  into  WideWord 
registers,  and  the  dense  vector  multiplies  are  performed  in  the  WideWord  floating-point  unit. 
The  accumulates  into  the  sparse  matrix  are  performed  sequentially.  Selective  execution  is  used  to 
select  the  field  of  the  WideWord  operand  that  participates  in  the  operation.  Further  performance 
improvements  are  obtained  by  reordering  memory  accesses,  grouping  streaming  accesses  to  the 
dense  arrays  to  achieve  page  mode  memory  access  latencies. 

The  CT  implementation  performs  a  hierarchical  in-place  matrix  transpose  where  the  smallest 
submatrices,  of  size  8x8,  are  transposed  in  WideWord  registers.  Each  8x8  submatrix  is  loaded 
into  the  WideWord  register  file  (an  8x8  matrix  with  32-bit  elements  requiring  8  WideWord 
registers),  and  transposed  via  a  sequence  of  permutation  operations.  The  transposed  submatrix  is 
then  stored  back  in  memory.  This  implementation  takes  advantage  of  the  large  capacity  of  the 
WideWord  register  fide,  avoiding  loads  and  stores  to  memory  during  the  transpose  of  each  8x8 
submatrix. 

TM  computes  three  correlation  values  between  an  image  and  each  of  32  templates,  each 
correlation  corresponding  to  a  loop  nest.  The  DIVA  implementation,  which  is  described  in  detail 
in  [chameOO],  takes  advantage  of  the  inherent  fine -grain  parallelism  by  operating  on  32  8-bit 
image  pixels  and  32  8-bit  template  elements  at  a  time.  Since  a  template  is  represented  as  a  32-by- 
32  matrix  of  8-bit  elements,  an  entire  template  row  fits  into  one  WideWord  register.  Also,  since 
the  innermost  loop  of  each  loop  nest  traverses  one  template  row,  the  entire  inner  loop 
computation  is  transformed  into  a  sequence  of  WideWord  operations  on  one  template  row  and  32 
pixels  of  an  image  row,  therefore  eliminating  the  innermost  loop.  The  accumulation  of  the  pixel 
values  is  achieved  by  a  parallel  reduction  sum,  and  the  result  of  the  reduction  sum  is  added  to  the 
correlation  value  using  selective  execution.  To  exploit  temporal  reuse  in  WideWord  registers,  we 
applied  common  loop  transformations,  particularly  unroll- and-j am.  In  addition,  we  exploited 
spatial  reuse  by  shifting  an  image  subrow  held  in  a  WideWord  register  by  one  pixel,  to  move  the 
window  of  the  image  to  be  compared  against  the  template.  As  in  CG,  we  also  reordered  memory 
accesses  to  achieve  page  mode  latencies. 
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TC  uses  a  dense  matrix  to  represent  the  distance  graph.  It  exploits  fine-grain  parallelism  by 
performing  WideWord  arithmetic  operations  on  eight  32-bit  elements  of  the  matrix  that  are  held 
in  WideWord  registers.  Selective  execution  using  a  WideWord  operation  (wmrgcc)  merges  the 
contents  of  two  WideWord  registers  according  to  condition-code  bits,  allowing  an  efficient 
computation  of  the  minimum  value  of  each  pair  of  elements  of  two  WideWord  operands.  Similar 
to  TM,  we  use  unroll-and-jam  to  obtain  temporal  reuse  in  the  WideWord  register  file. 

Overall  Speedups 

In  Figure  18,  we  present  speedups  for  four  benchmarks,  using  the  DIVA  system  over  executing 
the  applications  on  the  host  processor.  Our  experiments  show  significant  improvements  over  the 
host-only  execution  for  the  three  DIS  stressmarks  (Transitive  Closure,  Cornertum  and 
Neighborhood)  and  NAS  CG,  with  speedups  ranging  from  19.4X  to  39. 5X  on  a  64-node  system. 
These  high  speedups  are  in  spite  of  the  fact  that  the  PIM  processors  are  running  at  half  the  speed 
of  the  host,  and  are  in-order,  single-issue,  vs.  out-of-order,  4-issue  for  the  host. 

Our  CG  implementation  performs  a  parallel  reduction  to  accumulate  partial  results  computed 
locally  by  each  PIM  processor.  During  this  parallel  reduction  phase,  a  PIM  node  sends  its  local 
copy  of  the  result  array  to  another  PIM  node.  This  transfer  of  a  large  amount  of  data  to  a  same 
destination  processor  is  well  suited  for  the  streaming  mode  supported  by  our  parcel  mechanism. 
In  Transitive,  there  is  a  communication  phase  on  each  iteration  of  the  outermost  loop  of  a  3-deep 
loop  nest.  During  this  phase,  one  PIM  processor  sends  its  local  copy  of  a  matrix  row  to  all  other 
processors  executing  the  parallel  application.  This  communication  pattern  can  take  advantage  of 
the  multicast  mechanism  supported  in  DIVA.  Similarly,  Neighborhood  exhibits  communication 
patterns  that  can  take  advantage  of  the  streaming  parcel  mode. 


Figure  18,  Speedup  on  four  benchmarks  as  a  function  of  the  number  of  PIMs 
7.3  Earlier  Application  Studies 

At  the  initial  phase  of  the  project,  we  derived  a  set  of  benchmarks  that  could  be  used  for 
evaluation  purposes  throughout  the  project.  This  initial  set  consisted  of  six  benchmarks  selected 
from  well-known  scientific  benchmark  suites  (NAS,  Splash-2),  pointer-based  and  database 
benchmarks  (Sparse  from  McGill  and  007  from  University  of  Wisconsin),  as  well  as  the 
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template-matching  component  of  Sandia's  ATR  application,  and  the  Munkres  benchmark 
provided  by  Alphatech. 

To  evaluate  the  design  of  the  DIVA  ISA,  we  performed  experiments  using  this  set  of 
benchmarks.  One  of  the  goals  of  the  experiments  was  to  identify  useful  permutation  patterns  for 
rearranging  data  in  the  PIM  wide  registers,  using  the  wide  unit  permutation  network.  The  DIVA 
PIM  ISA  supports  efficient  permutation  operations  for  a  set  of  frequently  used  permutation 
patterns;  this  application  study  identified  frequently  used  permutation  patterns,  such  as  data 
shifting,  reductions,  sorting,  gather  and  scatter,  which  were  integrated  into  the  DIVA  PIM  ISA. 

In  another  experiment,  we  performed  simulations  on  the  template -matching  component  of 
Sandia’s  ATR  to  evaluate  the  benefits  and  trade-offs  of  the  WideWord  datapaths.  Using 
Wide  Word  operations  for  exploiting  fine-grain  parallelism  and  data  reuse  in  the  WideWord 
registers,  we  obtained  a  13x  reduction  in  the  number  of  dynamic  instructions  and  a  300x 
reduction  in  the  number  of  dynamic  memory  accesses.  These  improvements  led  to  an  overall 
speedup  of  38.3  on  a  system  with  32  PIMs. 

We  demonstrated  a  speedup  of  20. 6x  on  the  NAS  CG  benchmark,  over  execution  on  a  high-end 
workstation  based  on  the  MIPS  RIOOO.  Several  architecture  features  of  DIVA  contributed  to 
these  speedups:  the  lower  memory  latencies  on  PIM  chips,  the  PIMs  wide  datapaths  for  parallel 
memory  operations  and  efficient  communication,  and  a  WideWord  floating-point  unit  that  allows 
four  double  floating-point  operations  to  be  performed  in  parallel.  For  these  experiments,  we 
modeled  in  the  simulator  a  WideWord  floating-point  unit  capable  of  performing  four  double 
precision  fioating-point  operations  (our  second  DIVA  chip  supports  eight  single  precision 
fioating-point  operations  performed  in  parallel). 

We  performed  an  initial  mapping  of  three  of  the  DIS  benchmarks  (Image  Understanding,  Ray 
Tracing  and  Method  of  Moments)  to  the  DIVA  architecture,  including  data  and  computation 
partitioning  between  host  and  PIM  processors,  parallelization  (coarse-  or  fine-grain),  and  data 
locality  optimizations.  We  did  not  complete  our  studies  of  the  DIS  benchmarks,  since  soon  after 
performing  the  mappings,  the  DIS  stressmarks  were  introduced  and  became  the  benchmark  suite 
used  by  all  the  DIS  projects.  We  subsequently  concentrated  our  resources  on  experimenting  with 
the  DIS  stressmarks.  As  a  result,  we  did  not  produce  performance  results  for  the  benchmarks. 
Nevertheless,  for  archival  purposes,  we  include  the  most  interesting  aspects  of  the  mappings  here. 
We  spent  the  most  time  on  Image  Understanding,  which  has  three  core  computations:  a 
Morphological  Filter  that  compares  a  kernel  to  an  image.  Region  Selection  based  on  results  of 
filtering,  and  Feature  Extraction  that  identifies  features  within  the  regions.  The  first  of  these  was 
handcoded  to  use  DIVA's  WideWord  unit.  The  second,  which  accounted  for  only  a  small  amount 
of  the  sequential  computation,  was  performed  on  the  host  processor.  The  third  part  is  executed  in 
the  DIVA  PIMs.  For  Ray  Tracing,  we  obtained  good  parallel  speedups  by  replicating  a  small 
object  database  on  each  PIM  and  performing  the  screen  pixel  computation  in  a  cyclic  fashion.  If 
instead  the  object  database  is  large  and  replication  is  not  feasible,  the  costs  of  frequent  irregular 
communication  would  dominate  performance. 

8.  Emulator 
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8.1  Hardware 

As  part  of  the  DIVA  architecture  development,  an  FPGA-based  emulator  was  constructed  to 
provide  an  early  platform  for  software  development  and  demonstrations.  This  effort  produced 
two  versions  of  hardware  in  response  to  track  developments  and  requirements  emerging  from  the 
primary  architecture  effort. 

The  DIVA  emulator  is  a  single-board  peripheral  device  designed  to  plug  into  a  commercial 
Linux  PC  system.  It  is  based  on  commercial  Xilinx  Field-  Programmable  Gate  Arrays  (FPGAs) 
and  may  be  configured  to  support  a  wide  variety  of  applications  beyond  the  emulation  of  DIVA 
processors.  The  emulator  is  designed  to  support  rapid  configuration  as  a  DIVA  PIM  processor 
for  executing  DIVA  programs,  however,  it  is  also  a  general-purpose  FPGA  engine  capable  of 
supporting  a  wide  range  of  hardware  modeling  applications.  Table  2  summarizes  the  hardware 
features  of  the  emulator. 


Table  2.  Emulator  Hardware  Features 
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As  is  shown  in  Figure  19  of  the  first  version  of  the  DIVA  emulator,  the  emulator  circuit  is 
constructed  on  two  printed  circuit  boards  stacked  to  form  a  thin  sandwich.  The  emulator  meets 
PCI  physical  size  restrictions,  even  with  components  mounted  on  both  sides  of  the  two  boards. 


Figure  19,  Photograph  of  emulator  board 
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In  addition  to  the  FPGAs,  DRAM  and  SRAM  memories,  and  PCI  bus  interfaee  ASIC 
(Applieation  Specifie  Integrated  Cireuit),  the  main  emulator  board  also  contains  a  small  Atmel 
microcontroller  used  for  power  control  and  FPGA  thermal  monitoring,  and  voltage  regulation 
circuits  to  supply  the  FPGAs  with  power.  The  Atmel  microcontroller  can  communicate  with  the 
host  system  via  a  “mailbox”  in  the  PCI  interface  ASIC,  enabling  the  host  to  issue  commands  for 
power  control  and  clock  rate  generation.  Figure  20  depicts  how  the  emulator  board  components 
are  interconnected,  and  can  be  used  as  a  guide  for  partitioning  new  logic  designs  so  they  can  best 
fit  the  available  resources. 

The  on-board  power  regulation  circuit  delivers  1.8  VDC  and  2.5  VDC  to  the  FPGAs  and  other 
on-board  devices.  The  1.8  V  level  is  used  for  powering  the  FPGA  internal  circuits,  while  the  2.5 
V  rail  is  used  to  supply  power  to  the  input/output  pins  of  the  FPGAs,  memories,  and  PCI 
interface  ASIC. 

8.2  Software 

8.2.1  Linux  Driver 

The  Linux  driver  for  the  emulator  is  written  to  be  compatible  with  RedHat  Linux  v7.  The  driver 
provides  interrupt-handling  code  (not  used  in  DIVA  emulations)  plus  basic  services  -  device 
open,  read,  write,  etc.  -  Required  by  applications  programs  such  as  the  user  command  program. 

8.2.2  User  Command  Program 

The  emulator  user  control  program  is  a  simple  application  that  provides  the  user  with  a  simple 
set  of  commands  to  control  the  emulator  board.  Table  3  is  a  short  description  of  the  commands 
available  to  users. 


Table  3,  User  control  commands 


C’OMM.XND 

AIU.lJMIiN  I  S 

DliSCRIPTION 

IIW  Reset 

— 

IJn-configures  all  FPG.\  logic  (equi\  alent  ti>  s\  stem 
initiali/ation) 

S\\  Reset 

— 

Halls  1  PG.\  operation  and  initializes  all  user  stale 
machines 

l.oad 

<11  le  name> 

Loads  1  PtiA  configuration  from  user-specified  llle 
( .Xilinx  "bit  llle") 

Power 

#.  1  /() 

Selecti\el\  powers  FPCiA.  1)R.\M/SRAM  zone 
on/off 

Clock 

N=!  1  2  4  8  16! 

Sets  main  1  PG.X  clock  to  (40Mllz..N) 

Run 

-— 

Releases  IPG  .As  to  run  current  conllguration 

Stop 

... 

1  lalts  current  run 

Step 

— 

(  auses  clock  generator  to  issue  one  clock  pulse 
(single  step) 
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Figure  20,  Schematic  and  photograph  of  emulator  hoard  interconnect  details 


8.2.3  Graphical  User  Interface 

Figure  21  shows  the  user  eommand  program  display  panel.  The  underlying  text-only  command 
interface  has  been  overlaid  by  a  simple  graphical  interface  that  allows  the  user  to  control  the 
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operation  of  the  emulator,  ineluding  single-  or  multiple-eloek  exeeution  stepping,  and  a  display 
panel  to  report  the  eontents  of  registers  and  memory  loeations  within  the  emulated  processor. 


Figure  21,  Emulator  GUI 
8.3  Edge  Detect  Demonstration 

The  emulator  was  used  to  demonstrate  execution  of  a  simple  DIVA  program  for  edgedetection 
(Sobel  filtering)  in  a  small  (256x256  pixel)  image.  While  simple  in  construction,  this  program 
requires  the  execution  of  over  two  million  DIVA  instructions  to  complete.  The  photographs  in 
Figure  22  are  typical  of  images  used  in  the  demonstration,  and  the  corresponding  results  of  edge 
detection.  Changing  the  threshold  value  used  to  determine  the  presence  of  an  edge,  or  light/dark 
transition  can  reduce  the  amount  of  “clutter”  visible  in  the  result. 
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In  this  demonstration,  the  host  system  loaded  the  DIVA  PIM  program  into  memory-  SRAM  -  on 
the  emulator  card.  The  input  image  was  loaded  into  PIM  storage  -DRAM  -  by  the  host  system. 
The  emulator  was  directed  to  run  the  program,  which  used  the  original  in  PIM  storage  to 
generate  results  that  were  placed  in  another  region  of  DRAM.  When  execution  completed,  the 
host  could  read  the  results  directly  from  PIM  storage  and  display  it  in  a  window  for  viewing.  The 
edge  detect  program  required  approximately  one  second  to  execute. 

8.4  Lessons  Learned 

Several  valuable  lessons  were  learned  during  the  development  of  the  emulator. 

8.4.1  Nominal  Clock  Rate  Isn’t 

According  to  Xilinx,  the  DIVA  emulator  was  the  first  design  to  use  the  XCVIOOO  devices.  It 
soon  became  apparent  that  the  FPGAs  would  not  support  the  initial  target  of  40-megahertz  clock 
speed  -  the  FPGA  wiring  resources  would  not  consistently  propagate  signals.  In  fact,  Xilinx 
provided  special  wiring  paths  to  propagate  critical  signals  over  long  distances  within  the  FPGA. 
Unfortunately,  these  wiring  paths  constituted  less  than  ten  percent  of  the  available  wiring 
resources,  requiring  that  every  new  FPGA  design  be  hand  placed  and  routed  for  efficiency.  As  a 
result,  the  nominal  clock  rate  of  the  emulator  was  reduced  to  ten  megahertz. 

8.4.2  Partitioning  Across  FPGAs  Is  A  Hard  Problem 

As  the  architecture  evolved,  it  became  apparent  that  a  PIM  processor  with  a  full  WideWord 
datapath  would  not  fit  in  a  single  XCVIOOO.  This  forced  a  large  amount  of  effort  to  be  expended 
in  partitioning  the  node  across  two  FPGAs:  one  for  the  scalar  (32-bit)  datapath  and  the 
instruction  pipeline,  one  for  the  WideWord  datapath. 
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l'it;urc  23.  PIM  notic  archilcclurc  partitioninu  across  tMO  Xlinx 


Figure  23  shows  how  the  PIM  proeessor  was  partitioned  aeross  the  emulator  FPGAs  and  other 
board-level  resourees.  First,  while  the  emulator  effort  as  started  at  the  beginning  of  the  DIVA 
effort,  the  evolving  nature  of  the  architeeture  made  it  very  diffieult  to  anticipate  the  eventual 
logic  requirements  of  the  ASIC.  The  first  version  of  the  emulator  was  built  with  Virtex 
XCVIOOO  devices,  which  claimed  to  deliver  a  capacity  of  one  million  logic  gate  equivalents.  As 
DIVA  was  originally  conceived,  this  would  have  been  more  than  adequate  to  configure  a  full 
DIVA  PIM  processor  -  indeed;  this  was  the  reason  four  copies  of  the  FPGA/DRAM/SRAM 
cluster  were  implemented  on  a  single  board 

8.4.3  FPGA  Tools  Are  Not  Robust  (WideWord  Impact) 

After  the  scalar  32-bit  processor  was  demonstrated  with  the  edge-detect  program,  the  WideWord 
(256-bit)  datapath  design  was  begun.  This  design  was  simplified  by  the  fact  that  the  scalar 
datapath  could  be  replicated  and  modified  to  implement  the  variable  word  width  features  of  the 
WideWord  instructions.  This  modified  datapath  was  then  copied  eight  times  to  produce  the 
WideWord  logic.  At  this  point  in  the  design  the  design  tools  distributed  by  the  FPGA 
manufacturer,  Xilinx,  broke,  and  did  so  in  unpredictable  ways.  Compilation  runs  would  freeze, 
abort  at  random  points  in  the  process,  or  would  refuse  to  begin.  Incomplete  runs  would  not 
produce  any  output  data,  so  it  was  essentially  impossible  to  determine  what  aspect  of  the  design 
was  causing  the  failure. 
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Although  Xilinx  responded  to  some  of  these  errors  with  additional  releases  of  software,  we  did 
not  receive  the  level  of  support  required  to  work  through  these  problems.  It  was  decided  that  the 
design  of  the  WideWord  unit  would  have  to  be  further  partitioned  to  get  any  design  to  complete. 

8.4.4  Cycle  Accuracy  Requires  More  Clock  Cycles 

The  basic  operating  requirement  for  the  emulator  was  to  provide  cycle-accurate  results.  That  is, 
at  the  end  of  every  clock  cycle,  every  register  should  contain  correct  results.  This  requirement 
enabled  the  emulator  design  to  be  further  partitioned  so  that  the  WideWord  could  be  represented 
by  four  64-bit  datapaths,  each  executing  the  current  instruction  in  one  quarter  of  the  pipeline 
clock.  This  partitioning  drove  the  final  execution  speed  of  the  emulator  to  2.5  MHz,  which  is  still 
very  acceptable  when  compared  to  software  simulations.  Figure  24  depicts  the  basic  pipeline 
clock  partitioned  into  eight  microcycles. 


Figure  24,  Partitioning  of  clock  cycles  into  microcycles 

The  colored  bands  illustrate  how  one  pipeline  clock  can  be  divided  into  sixteen  microcycles 
should  the  need  arise.  The  emulated  DIVA  PIM  hardware  executes  WideWord  instructions  using 
eight  micro-cycles  -  four  are  used  for  each  of  the  64-bit  operations,  the  remaining  four  are  used 
to  guarantee  safe  data  storage  in  the  WideWord  register  file  and  to  avoid  bus  conflicts  when 
making  a  selection  among  one  of  the  four  64-bit  data  fields. 

9.  Prototype  System  Integration 

The  goal  of  the  prototype  system  was  to  produce  a  stable,  high  bandwidth  demonstration 
platform  for  DIVA  PIMs.  In  addition  it  was  to  provide  an  environment  in  which  to  debug  and 
performance  monitor  the  first  PIM  chips. 

The  demonstration  platform  required  several  areas  of  effort  including: 

Host  Node  Board 
Host  Peripheral  10 
Host  Operating  System  Code 
PIM-ulator 
Assembler  &  Linker 
PIM-Specific  Code 
-  PIM  SO-DIMM 
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9.1  Host  Node  Board 

A  custom  PPC  603  e  based  node  board  was  used  from  funding  under  the  ASNT  projeet.  It 
eontains  an  MPC106  eombination  memory  controller  and  host  bridge  for  PCI.  Designed  at  ISI 
this  allows  straightforward  modifieations  to  both  hardware  and  firmware  for  PIM  operation. 

9.2  Host  Peripheral  I/O 

The  host  node  PCI  port  provides  a  method  for  off-the-shelf  subsystems  to  be  used  for  standard 
I/O  functions.  An  expansion  CPCI  (Compact  Peripheral  Component  Interconnect)  chassis  and 
ethemet,  video,  scsi,  and  serial  io  eards  were  purchased  and  checked  out  with  the  PPC  603e- 
based  PCs  on  hand  from  the  ASNT  projeet. 

9.3  Host  Operating  System  Code 

Though  the  host  node  has  only  skeleton  firmware,  it  was  thought  that  Linux  would  be  able  to 
boot  when  provided  with  a  device  tree.  That  was  necessary  but  not  sufficient.  Each  peripheral 
may  contain  its  own  custom  firmware  that  must  be  executed  in  a  delicate  interplay  with  the  host 
node  firmware  (either  Open  Firmware  or  BIOS  compliant)  in  order  to  be  LINUX  (or  any  other 
OS  for  that  matter)  bootable.  Progress  has  been  made  toward  hand-executing  this  interplay,  but 
in  the  end  the  pace  was  insufficient  for  the  project  needs.  Per  the  PIM  Speeific  Code  section 
below,  a  small  OS  ealled  RTEMS  was  to  be  used  for  the  PIM  and  was  also  pressed  into  service 
for  the  host  node.  A  port  of  RTEMS  was  made  to  the  host  node  and  its  skeleton  boot  firmware 
that  allowed  TTY  eonsole  eommunication  in  a  matter  of  weeks.  The  port  aecomplished  three 
things:  provided  experienee  with  RTEMS  in  an  easily  debugged  environment  (the  host  node), 
made  the  host  node  capable  of  controlling  and  performance  monitoring  the  PIM,  and  finally 
provided  a  reasonable  operating  system  for  the  development  of  PIM  memory  management  code. 
It  was  used  to  great  effeet  in  the  DARPA  Teeh  2002  demonstration  of  the  host  node  and  PIM 
noted  in  the  summary  below. 

9.4  PIM-ulator 

Concern  over  both  the  schedule  and  functionality  of  the  first  PIM  ehip  coupled  with  the 
existenee  of  unique  hardware  led  to  the  creation  of  the  PIM-ulator.  The  ASNT  Bridge  node 
hardware  contained  six  powerful  FPGA  devices  that  allowed  one  host  node  to  eommunicate  to 
another  via  external  L2  caehe  cycles.  In  that  way  one  host  node  eould  simulate  the  PIM 
proeessor  and  memory  while  the  other  acted  as  a  normal  host  node.  This  eonfiguration  allowed  a 
path  for  OS  and  memory  management  software  and  operational  interaction  between  host  and  a 
pseudo-PIM  without  the  real  PIM  chip. 

9.5  Assembler  and  Linker 

Open  source  tools  from  the  gnu  project  have  been  on  plan  from  the  projeet  outset.  The  first 
Assembler  for  DIVA  was  a  port  pulled  from  the  MIPs  branch  of  the  gnu  assembler  tree  due  to 
similarities  in  the  Instruction  Set  Architecture.  It  was  used  for  the  Emulator  area  of  the  project 
described  elsewhere  in  this  doeument.  The  port  required  some  660  unique  versions  of  94  DIVA 
instructions.  The  assembler  and  linker  saw  standalone  use  in  the  Emulator  and  then  more 
extensive  and  integrated  use  as  the  chip  was  brought-up  and  tested. 
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High-level  compiler  support  was  desired  for  the  wide-word  chip  functionality.  The  front-end  of 
the  compiler  (gee)  was  pulled  from  the  PPC  branch  of  the  gnu  tree  due  to  the  availability  of  PPC 
Altivec  extensions.  From  this  branch,  the  backend  of  the  compiler  was  modified  to  produce 
DIVA  assembly  mnemonics  as  input  to  the  assembler.  The  two  worlds  of  MIPs  and  PPC  collided 
as  the  gee  tool  chain  was  used  as  a  whole.  The  PPC-based  backend  was  sufficiently  incompatible 
with  the  MIPs  based  assembler  to  require  a  port  of  the  MIPs  rewrite  to  the  PPC  assembler  base. 
The  compiler  was  then  able  to  work  from  the  DIVA-modified  Altivec  extension  front  end, 
through  the  DIVA-modified  backend  and  finally  out  the  DIVAmodified  PPC  assembler  and 
linker.  This  combination  has  seen  much  use  in  conjunction  with  the  PIM  hardware  and  the  host 
node  system. 

9.6  PIM  Specific  Code 

The  PIM  is  to  have  multiple  threads  operational  on  the  chip  under  control  of  the  Run-  Time 
Kernel  (RTK).  Initially  it  was  to  be  a  custom  in-house  design,  but  as  the  intricacies  of  coherent 
management  of  memory  from  the  host  node  side  and  PIM  node  side  became  apparent  it  was 
decided  to  concentrate  on  those  intricacies  and  use  something  off-the-shelf  for  the  bulk  of  the 
less  novel  details.  RTEMS,  real-time  operating  system  initially  designed  for  mission  critical 
guidance  and  control  systems  was  chosen  for  its  capabilities,  small  footprint  and  open-source 
status.  It  was  ported  and  built  for  the  PPC-based  host  node  as  mentioned  above  under  Host 
Operating  System  Code. 

The  memory  management  code  was  the  target  of  much  effort  leading  to  a  paper  published  in  the 
Proceedings  of  the  Workshop  on  Intelligent  Memory  Systems,  held  in  conjunction  with 
Architectural  Support  for  Programming  Languages  and  Operating  Systems  in  November  2000. 
The  code  development  of  this  PIM-specific  code  was  implemented  and  simulated  in  a  LINUX 
environment  and  is  to  be  ported  to  RTLMS  with  only  a  moderated  amount  of  expected  effort. 

9.7  PIM  SO-DIMM 

After  the  PIM  chip  passed  initial  functional  test  in  a  test  board  connected  to  a  logic  analyzer,  the 
design  of  a  system  memory  board  was  finished  and  fabricated.  It  consisted  of  two  PIM  chips  on 
an  SDRAM  SO-DIMM  form  factor  memory  board.  The  two  chips  may  be  interconnected  to  each 
other  or  to  other  PIMs  on  other  memory  boards.  Logically  this  interconnection  is  accomplished 
with  the  Parcel  buffer;  physically  it  is  with  ribbon  cables.  These  memory  cards  were  tested  out  in 
the  host  node  first  as  common  SDRAM  memory,  addressed  with  two  different  chip-selects  from 
the  memory  controller.  With  reliable  operation  of  the  memory  subsystem  the  focus  turned  to 
running  the  Cornerturn  stressmark  kernel  on  the  chip. 

9.8  DARPATech  Demonstration 

In  the  spring  of  2002  ISI  was  invited  to  present  a  demonstration  of  DIVA  PIM  technology  at  the 
DARPATech  Symposium  at  the  end  of  July.  It  provided  an  additional  goal  and  focus  during 
those  months.  Ten  packaged  PIM  chips  were  assembled  onto  five  SO-DIMM  memory  module 
boards,  one  shown  in  Ligure  25. 
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Figure  25,  SO-DIMM  memory  module  board. 


Within  a  week  the  memory  interface  to  both  chips  was  proven  operational.  The  next  business  day 
the  Comerturn  stressmark  code  that  was  verified  on  the  PIM  test  board  was  running  at  speed  in 
the  PIM  on  the  SO-DIMM  inserted  into  the  host  node  demonstration  system,  shown  in  Figure  26. 


Figure  26,  Host  node  demonstration  system 

Many  different  aspects  of  the  host  node  and  PIM  required  attention  and  could  have  jeopardized 
the  demonstration.  Memory  tests  that  logged  number  of  and  location  of  last  error  were  written 
for  the  PIM  memory  to  ensure  enough  good  memory  space  for  the  code  and  data.  The  host  node 
memory  controller  required  parameters  for  the  new  memory  since  there  is  no  host  node  Open 
Firmware  (BIOS).  The  host  node  SO-DIMM  sockets  were  replaced  with  22.5  degree  sockets  to 
accommodate  the  oversized  wing  of  the  PCB  that  holds  the  PIM  chips.  Small  clock  and  reset 
modules  were  made  to  provide  these  functions  to  the  host  node  when  standing  alone  in  a  CPCI 
cage.  A  chip  reset  line  which  enables  operation  from  a  reset  vector  was  also  wired  to  a  pin  set 
aside  for  such  on  the  memory  socket,  while  the  host  node  CPLD  (Complex  Programmable  Logic 
Device)  was  enhanced  with  register  support  of  an  I/O  line  wired  to  that  pin  for  reset  control. 
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The  software  provided  another  set  of  eonstraints.  Our  goal  beearne  to  put  the  PowerPC  host  node 
into  a  raee  with  the  PIM.  RTEMS  was  used  to  manage  this  raee  on  the  host  and  at  the  same  time 
use  task  priorities  to  ensure  full  proeessing  time  was  given  to  the  host.  The  PIM  Comerturn 
applieation  was  hand  written  and  hand  assembled,  while  the  host  Comerturn  was  written  in  C 
and  automatieally  eompiled  and  assembled  with  gee  and  gas  for  the  PowerPC  with  no 
optimizations.  The  resultant  assembly  eode  was  eompared  for  similarity  to  the  PIM  eode  and  was 
within  a  few  pereent  of  the  same  eyele  eount. 

The  eode  and  data  were  loaded  through  an  emulator  to  both  the  PowerPC  and  the  PIM  memories. 
The  data  size  was  32k  bytes,  8k  32-bit  integer  values.  This  data  size  was  deliberately  larger  than 
the  PPC603e  16k  byte  data  eaehe.  Then  under  eontrol  of  RTEMS  via  the  PCI  serial  eard  the 
demonstration  was  started.  The  host  eounted  off  1000  iterations  of  Comerturn.  The  PIM  was  let 
mn  during  that  time.  The  PIM  performed  over  35,000  iterations  yielding  a  35x  speedup.  The 
eloek  speed  of  the  603e  was  166  MHz  while  the  PIM  was  133  MHz.  The  numbers  illustrate  both 
the  large  penalty  for  eaehe  miss  behavior  on  the  host  (-'13  bus  eyeles  @  66MHz  for  205ns)  and 
the  large  benefit  of  very  low-lateney  (-3  eyeles  @  133MHz  for  23ns)  aceess  to  main  memory  for 
the  PIM  proeessor. 

9.9  Stressmark-on-Chip  Verification 

Continuing  forward,  we  realized  that  many  parts  of  the  system  required  verifieation  at  onee:  the 
ehip,  the  system  interfaee,  the  assembler,  the  eompiler  baekend  as  well  as  the  eompiler.  With  a 
small  team  and  a  plan  for  a  seeond  release  of  the  ehip  with  more  features,  we  have  adopted  a  test 
strategy  of  using  the  DIS  Stressmark  suite  with  known  inputs  and  outputs  to  give  maximum 
funetional  eoverage  with  minimum  effort.  To  that  end,  we  have  taken  the  C  versions  of 
Comerturn  and  Transitive  Closure  through  the  DIVA  eompiler  and  assembler.  The  kernel  of  the 
stressmark  is  then  extraeted,  setup  eode  and  the  known  input  data  is  appended  and  the  eode  is  mn 
on  the  ehip.  The  outputs  are  then  eheeked  against  known  good  output  from  goo  builds  and  mns 
on  a  Spare  workstation. 

This  method  has  turned  up  a  handful  of  bugs  in  several  different  areas  and  is  proving  to  be  a 
viable  approaoh  under  the  limited  time  eonstraints. 

Reoent  verifieation  work  has  shown  suooessful  exeoution  of  bi-direotional  message  passing, 
along  with  transitive  olosure,  pointer,  and  2-pim  transitive  with  integral  ohip-to-ohip 
oommunioations. 

9.10  Future  Work 

The  integration  effort  as  a  whole  is  still  paying  dividends.  The  HPCS  projeot  is  using  the  ourrent 
system  to  measure  DIVA’s  performanoe  on  the  StreamAdd  benohmark  and  projeot  expeoted 
performanoe  for  the  HPCS-sponsored  Godiva  system.  The  next  ehip  turn  inoorporates  a  DDR 
interfaee,  mounted  on  full  size  DIMM  memory  oards  plugged  into  a  oommodity  Itanium-based 
workstation  as  a  test  benoh  for  the  larger  system  oonoepts. 
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11.  Professional  Personnel 

11.1  Research  Area  Leaders: 

•  Dr.  John  J.  Granaeki,  Prineipal  Investigator 

•  Dr.  Mary  Hall,  Co  Prineipal  Investigator 

•  Dr.  Jeffrey  Draper,  VLSI  Team  Leader 

•  Dr.  Jaequeline  Chame,  Simulation  and  Applieations  Team  Leader 

•  Mr.  Jeffrey  LaCoss,  Emulator  Team  Leader 

•  Mr.  Tim  Barrett,  System  Integration  Team  Leader 

11.2  Doctoral  students 

•  Dr.  Louis  Luh,  PhD,  May  2000,  Thesis  Title:  High-Speed  CMOS  Continuous-Time 
Switehed-Current  Sigma-Delta  Modulators 

•  Dr.  Herming  Chiueh,  PhD,  Aug  2002,  Thesis  Title:  A  Thermal  Management  Design  for 
System-on-Chip  Cireuits  and  Advaneed  Computer  Systems 

•  Dr.  Yuyu  Chang,  PhD.  September  2002,  Thesis  Title:  CMOS  Giga-Hertz  Band  Lilters 
with  Automatie  Tuning  Cireuitry  for  Communieation  Applieations 

•  Joong-Seok  Moon,  PhD  expeeted  Aug  2003 

•  Jaewook  Shin  (PIM-speeifie  optimizations,  integration  of  DIVA  sealar  GCC  and 
AltiVee-extended  GCC,  integration  of  DIVA  eompiler  with  MIT-SLP,  DIVA 
implementations  of  CornerTum,  Lield  and  NAS  CG,  DIVA  simulator  library 
implementation  eomponents),  Phd  expeeted  2004 

•  Chun  Chen  (DIVA  implementations  of  Neighborhood  stressmark.  Image  Understanding 
and  Ray  Traeing  benehmarks,  port  of  simulator  to  Condor) 

•  Hang  Shi  (DIVA  implementations  of  Transitive  stressmark.  Method  of  Moments 
benehmark) 

•  Ruoming  Pang  (DIVA  implementations  of  Natural  Join  and  007  benehmarks) 

•  Chang  Woo  Kang,  PhD  TBD 

•  Ihn  Kim,  PhD  TBD 

•  Taek-Jun  Kwon,  PhD  TBD 

•  Sumit  Mediratta,  PhD  TBD 

11.3  Masters  students 

•  Somphol  Boonjing  (Applieation  Binary  Interfaee  for  node  eompiler),  MS  Deeember  2000 

•  Saehit  Chandra  (VLSI)  MS  expeeted  Aug  2003 

•  Gokhan  Dagltkoea  (VLSI) 

•  Prashant  Desai  (design  for  assembler,  baekend  eompiler,  integration  with  WideWord 
instruetions),  MS  Deeember  2000 

•  Rommel  Dongre  (GCC  sealar  baekend  implementation),  MS  Deeember  2001 
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•  Yamini  Kaur  (VLSI) 

•  Junaid  Qazi  (VLSI) 

•  Shyam  Sethuram  (DIVA  simulator  implementation  eomponents),  MS  May  2002 

•  Vijay  Srinivasan  MS  Deeember  2003 

11.4  Other  Collaborators 

•  USC/ISI:  Mr.  Dale  Chase,  Mr.  Jeff  Sondeen,  Dr.  Bill  Athas,  Dr.  Jeff  Koller,  Dr.  Craig 
Steele,  Mr.  Mike  Gorman,  Dr.  Apporv  Srivastava,  Ms.  Diane  Delute,  Mr.  Bert  White,  Dr. 
Pedro  Diniz.  Mr.  Pablo  Moissett 

•  Calteeh:  Dr.  Thomas  Sterling,  Mr.  Daniel  Savarese 

•  University  of  Notre  Dame:  Dr.  Peter  Kogge,  Dr.  Jay  Broekman,  Dr.  Vineent  Freeh,  Mr. 
Bedros  Hanouik,  Mr.  Riehard  Murphy,  Mr.  Rieh  Kendall,  Mr.  Alexi  Koundraiov,  Ms. 
Shannon  Kuntz,  Mr.  Jason  Zawodny,  Mr.  Arun  Rodrigues,  Mr.  Edward  Kang 

•  University  of  Delaware:  Dr.  Guang  Gao,  Dr.  Kevin  Theobald,  Mr.  Tom  Geiger 

•  AlphaTeeh:  Dr.  Mark  Luettgen,  Dr.  Bob  Tenney 

12.  Results,  Conclusions  &  Technology  Transfer 

The  single  most  important  result  produeed  by  the  DIVA  Projeet  is  a  complete  working  system 
that  demonstrates  the  advantages  of  PIM  technology  used  as  “smart  memories”.  This  is  the 
proof  of  eoneept  that  “smart  memory”  ean  help  ameliorate  the  “memory  wall”  that  limits  the 
performanee  of  present  day  memory  systems. 

This  achievement  paves  the  way  for  further  research  on  systems  with  heterogeneous  memory 
systems,  that  is,  PIM  and  conventional  DRAM  used  together;  PIM-based  memory  hierarchies, 
for  example,  PIM  caches;  studying  and  evaluating  larger  applications  problems;  namely,  those 
that  cannot  be  run  on  a  simulator  or  emulator  and  combining  this  technology  in  new  ways  or 
incorporating  it  with  other  technology  into  new  architectures. 

Two  follow-on  research  projects  have  already  started  to  build  on  the  DIVA  technology 
MONARCH  under  the  DARPA-sponsored  Polymorphous  Computer  Architecture  Program  and 
Godiva  under  the  High  Productivity  Computing  System  Program.  Perhaps  of  even  greater 
significance  these  new  projects  are  expanding  the  research  and  extending  the  technology  in 
partnership  with  large  industrial  partners.  MONARCH  is  a  joint  project  with  the  Raytheon 
Corporation,  a  leading  defense  contractor  and  Mercury  Computing,  the  largest  supplier  of 
embedded  computers  to  the  military.  Godiva  is  a  joint  project  with  Hewlett  Packard,  a  major  U.S. 
computer  vendor.  Both  of  these  projects  represent  significant  IP  transfer  from  DIVA  but  also 
represent  a  high  possibility  for  insertion  of  DIVA  technology  into  real  military  and  commercial 
systems. 

A  “second  turn”  of  the  DIVA  VLSI  funded  under  the  MONARCH  Project  will  also  incorporate 
floating-point  unit  into  the  WideWord  unit  greatly  enhancing  DIVA’s  applicability  to  a  broader 
class  of  scientific  problems. 

The  DIVA  team  has  briefed  many  of  the  research  leaders  of  major  U.S.  companies  like  IBM, 
Intel,  Hewlett  Packard  and  Sun,  as  well  as  several  venture  capitalists  that  have  expressed  an 
interest  in  DIVA  technology.  We  have  also  briefed  the  Deputy  Under  Secretary  of  Defense  for 
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Science  and  Technology,  NSA’s  Director  of  Computing  and  the  DOE’s  ASCI  Program  Manger. 
We  will  continue  to  inform  the  decision  makers  about  this  technology. 

The  main  issue  with  the  acceptance  of  PIM  and  Embedded-DRAM  technology  is  the  cost- 
performance,  that  is,  does  the  added  cost  of  combining  DRAM  on  the  same  chip  with  the  logic 
processing  warrant  the  added  expense  of  manufacturing  these  die.  This  is  a  complex  question 
and  depends  on  the  specific  application  and  also  the  semiconductor  technology.  At  this  time, 
there  is  definitely  a  premium  to  be  paid  for  the  added  performance  offered  by  systems  that  use 
PIM  technology. 

13.  Inventions,  or  patent  disclosures 

No  inventions  were  disclosed  or  patents  submitted  by  the  USC  DIVA  research  team. 
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Appendix  A:  DIVA  PIM  Processor  ISA 
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Chapter  1  -  DIVA  Instruction  Set  Overview 


Scalar  liistnictioit 
I'orniats 


As  shown  in  I'lgure  1,  the  DIVA  scalar  instruction  uses  a  three-operand  format  to  specify  two  32-bit  source  registers  and  a  32-bit  target  reg¬ 
ister.  For  arithmetic  logical  instructions  using  this  format,  there  is  also  a  ('  bit  to  indicate  whether  the  current  instruction  updates  condition 
codes.  I  lowever,  the  C  bit  indicates  signed,  unsigned  arithmetic  for  multiply  di\  ide  instructions,  since  these  instructions  never  update  condi¬ 
tion  codes  I'’,  definition.  In  lieu  of  a  second  source  register,  a  16-bit  immediate  value  m.i\  be  sivcified.  as  shown  in  Fijiirc  2. 
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function 

Fi<;iire  1  Format  R  for  Scalar  Register  Operations 
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iD 

lA 

immediate 

Figure  2  Format  I  for  Scalar  Immediate  Operations 


The  branch  instruction  formats  are  shown  in  Figure  3.  The  branch  target  address  ma\  be  PC-relative  or  calculated  using  a  base  register  OFied 
w  ith  an  offset.  In  brrth  formats,  the  oiVset  is  in  units  of  words,  or  4  bv  tes.  since  instructions  must  be  on  a  4-b\  te  boundary  .  Furthermore,  the 
I.  bit  siv’cifies  linkage,  that  is,  whether  a  return  instruction  address  should  be  saved  in  R3I,  referred  to  as  a  call  instruction  .Also,  the 
field  specifies  one  of  eight  branch  conditions:  alway  s,  equal,  not  equal,  less  than,  less  than  or  equal,  greater  than,  greater  than  or  equal,  or 
ov  ertlow.  See  the  branch  and  call  instruction  descriptions  for  details. 
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Figure  3  Format  B  for  Braiielies 
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H  ule  H  ord  liisiniLiioii 
I'onnuls 


As  shown  in  Iigure  4.  "'Wide  Word  Arithmetic  l.ogical  I'ormat.”  Wide  Word  instructions  follow  the  general  form  of  scalar  instructions.  Addi¬ 
tional  control  information  is  included  to  manage  the  data  fields  of  the  WideVVord.  and  to  modify  the  execution  of  the  instruction.  Figure  5 
shows  the  fomtat  for  transfers  within  the  W'ideWord  register  file  and  across  the  scalar  and  W'ideWord  register  files. 
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Figure  4  Format  \\  lor  >MdeM'ord  .Vrilhmetic/I.oujeal  Operations 
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Figure  5  Format  F  for  \\  ide-\^'ord  and  Iiiter-Register  File  Fraiisfers 


The  control  fields  are  defined  as  follows: 

»  »  (width) 

The  l(7f  field  sets  the  width  of  the  W'ideWord  operands  to  eight,  sixteen,  or  thirty-two  bits,  which  primarily  affects  the  shift 
ojxirations  and  the  configuration  of  the  carry  chain  for  additions  and  subtractions.  For  the  merge  instruction,  these  bits  specify 
the  condition  on  w  hich  the  merge  is  based,  fhe  encoding  of  these  bits  is  listed  in  the  follow  ing  table: 


\^  W  \alue 

0|>er/ind  Width 

Assembler  Mnemonic 

00 

8  bits 

b 

01 

1 6  bits 

h 

10 

32  bus 

w 

11 

Reseived 

NA 

C  (condition  code  enahie) 

The  C  bit  indicates  w  hether  condition  ctxles  w  ill  be  updated  as  a  result  of  the  current  instruction's  e.xecution  1  lowever,  the  C 
bit  indicates  signed  unsigned  arithmetic  for  multiply,  pack,  and  unpack  instructions 

/’/’  (participation) 

fhe  /’/’  field  interacts  with  condition  codes  to  control  whether  a  computation  is  performed  on  a  given  data  field.  The 
participation  field  can  specify  that  a  data  field  participate  always,  only  if  a  condition  liKal  to  its  own  data  field  is  true,  only  if 
the  data  field  is  the  leftmost  field  with  a  condition  that  is  true,  or  only  if  the  data  field  is  the  rightmost  field  w  ith  a  condition  that 
is  true.  The  condition  that  is  in.s|x;cted  for  participation  depends  on  the  value  of  the  /M/  ( participation  mixle)  register.  Refer  to 
the  architecture  document  for  more  details.  The  encoding  of  the  /’/’bits  is  listed  in  the  following  table: 
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Condition  Codes 


PP  V'Alur 

Piirlici|»ilioti  I>rnni(ioii 

AMcembler  Mnemonic 

00 

Ahvav'S  participaie 

a 

01 

Specified  by  local  condition 

0 

10 

LeOnK>st  participation 

1 

II 

Rightmost  panicipation 

f 

T  (type) 

The  Tbit  go\eriis  whether  the  current  instruction  operates  on  a  vector  or  scalar  Depending  on  the  function,  /■/)  or  rA  may 
specify  a  VVideWord  register.  In  this  case,  the  7' bit  specifies  whether  the  current  transfer  instruction  refers  to  the  WideWord 
register  as  a  whole  v  ector  or  instead  uses  l^,j)  to  index  a  sub-field  of  the  WideWord  register 

1.4/1) 

Value  to  be  used  as  an  index  when  a  sub-field  of  a  W'ideWord  is  involved  in  a  transfer.  Depending  on  the  function,  this  index 
field  may  be  an  immediate  or  a  scalar  (iPR  specifier.  Also,  may  be  coupled  with  either  rD  or  r.i  depending  on  the 
direction  of  the  transfer  as  specified  bv  the  lunction. 

fhe  scalar  condition  code  register,  CC,  consists  of  5  bits  The  first  three  bits  of  f'fare  set  by  an  algebraic  comparison  of  the  result  to  zero; 
the  other  two  bits  hav  e  slightlv  more  peculiar  semantics.  The  condition  codes  have  the  CC  bit  labels  and  semantics  as  indicated  below.  Note 
that  LT,  (iT,  I:Q,  and  C.'\  condition  ccxles  are  u|xlated  only  if  the  current  instruction  has  its  condition  code  enable  bit  set.  fhe  OV  condition 


Cuiiditioii  (  odr 

(  (  bit 

DescripliMi 

1  1 

0 

1  Ins  bit  IS  set  when  the  result  is  negative. 

tif 

1 

I  liis  bit  is  set  when  the  result  is  positive  and  non-zero. 

i:<.> 

fhis  bit  is  set  when  the  result  is  zero. 

( )\' 

1  his  bit  IS  set  to  indicate  oveillow  has  oeeuned  during  execution  of  an  add 
or  subtract  instruction.  Fhis  bit  is  not  altered  by  any  otlicr  instructions.  In 
practice,  the  OV  bit  is  set  iftlie  carry  out  of  bit  0  is  not  equal  to  the  carry  out 
of  bit  1  (assuming  big  Ihtdian  bit  labeling). 

C  A 

4 

In  geneial.  the  carry  bit  ((.  .\|  is  set  to  indicate  that  a  carry  out  of  bit  0 
occurred  during  execution  of  an  add  or  subtract  instruction,  fhis  bit  is  not 
altered  by  any  other  instructions. 

code  IS  u|xlated  for  anv  scalar  add  or  subtract  operation,  regardless  of  the  condition  code  enable  bit  setting,  and  is  sticky,  that  is,  it  is  onlv 
cleared  when  the  condition  code  register  is  read. 

The  32-bit  l.T,  CiT,  EQ,  OV,  and  C.'\  registers  of  the  WideWord  datapath  have  analogous  semantics  to  the  corresponding  condition  code  of  the 
scalar  datapath.  For  instance,  each  bit  of  the  W'ideWord  l.'f  register  is  set  if  the  result  of  its  corresponding  8-bit  datapath  is  negative.  I  low- 
ever,  there  are  subtleties  due  to  the  configurability  of  the  o|Terand  sizes.  For  example,  if  a  WideWord  instruction  sfiecifies  that  operands  are 
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to  be  treated  as  32-bit  values,  the  conditioti  erxies  are  grouped  into  eight  groups  of  4,  w  here  each  bit  of  a  group  is  updated  with  the  same 
value  to  reflect  a  condition  for  the  group's  corresponding  32-bit  result. 

Similar  to  condition  codes,  the  Wide  Word  floating-point  status  register  ( FPSR  -  special-purjxise  register  1 5 )  may  be  u|xlated  to  reflect  excep¬ 
tion  conditions  for  floating-point  operations.  This  register  is  a  32-bit  register  arranged  in  groups  of  4  status  conditions  for  each  of  the  eight 
32-bit  floating-point  units  in  the  WideWord  datapath.  The  4  status  conditions  are:  div  ide  by  zero(DZ),  invalid  (IV),  inexact  (IX).  and  unsup¬ 
ported  value  (IJV ).  DZ,  IV,  and  IX  are  typical  IEEE-754  floating-point  exceptions.  Refer  to  the  lEEiEi-754  standard  for  details.  UV  indicates 
that  either  overflow  or  underflow  occurred  af  some  point  during  the  program.  .Ml  bits  of  FPSR  are  sticky;  once  set.  they  remain  set  until 
FPSR  IS  read  v  ia  an  mfspr  instruction.  The  bit  arrangement  for  FPSR  is  shown  below. 


I'PSR  Bit  .Arrangement 
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Concise  List 


I'ABLK  1.  DINA  Instruction  Set 


n  N( 

Di:s(  RIPI  IOA 

El  N( 

- BBonm; - 

- ufioumw - 

Kl.l 

Vsicm  1  ;tll 

Instruction  Cache  Line  Invalidate 

.Move  to  s|>ecial-pui|H)se  reg 

bv 

bunch  on  scalar  condition 

MISI'K 

.Move  tiom  sfiecial-purpose  reg 

B.V\ 

bunch  on  ^Ul  U  ideWord  conditions 

Ul-L 

Return  trom  h\cet)tion 

MTPR 

Move  to  protected  reg 

B\.\ 

bunch  on  rro  WideNNord  condition 

Ml'PIl 

Move  liom  protected  reg 

(  AEEi 

Call  on  scalar  condition 

Sciilar  Instructions 

MTATR 

Move  to  address  translation  reg 

('AEE.Vv 

(  all  on  all  WideWoid  conditions 

ADD 

Add 

MFAIR 

Mose  from  address  translation  reg 

(  ALLVv 

(  all  on  no  WideWoid  condition 

ADDi; 

Add  extended 

Add  initnediate 

ideV^orcl  in.Hlructioiis 

ADDK 

Add  immediate  nv  condition  codes 

WADD 

Aaa 

SI  B 

Subtract 

WADDE 

Add  extended 

SI  BE 

Subtract  extended 

WSI  B 

Subliaci 

SI  Bl 

Subtract  unsigned 

WSI  BE 

Subtiaci  extended 

Special  WideMord  Inslrurlions 

Mil. 

Mulli|>l\ 

\\  SI  III 

Subtiact  unsigned 

W  PRM 

IVimuie 

Ml  1.1 

MultiplN  uiiM.L'ned 

wMi'iim 

MailipA  even  signed 

UPRMI 

l‘eimule  immediate 

l>l\ 

Dnuk* 

W  Ml  I.Kl 

.Multipiv  even  unsigned 

WMRC 

Meige  based  on  condition  codes 

IMM 

Dnide  unsi.uned 

WMI'I.OS 

Multipiv  odd  signed 

W  PK.S 

Pack  using  signed  anlhmetic 

\\V> 

\m: 

WMI'I.OI' 

-Mullipls  odd  unsigned 

vmn 

Pack  using  unsigned  aiithmelic 

T^TJl — 

And  innnediale 

vrt\T5 

.\i:d 

\\  \  Pkll 

1  npack  high-ordei  bvte  halfword 

.VM>U 

And  immediate  vs  condition  codes 

wMvr — 

bitwise  inversion 

wiTia. 

Lnpack  low-oider  bvte  haltvvoid 

NOT 

bitwise  mseisuHi 

vrnu 

0: 

OU 

Or 

WVIJR — 

Xor 

Transfer  Inslrurlions 

OKI 

Or  immediate 

WSI.I. 

Shift  left  logical 

M\SW 

Move  scalar  lo 

ORK 

Or  immediate  vs  condition  codes 

WSl.I.I 

Shift  let\  logical  immediate 

M\SWI 

Move  scalar  lo  W'W,  indirect 

ORIS 

Or  immediate  shitted 

WSRA 

Shif  t  right  arithmetic 

M\AVS 

Move  \\\\  10  scalar 

XOR 

\or 

WSRAI 

Shift  right  arithmetic  immediate 

M\AVSI 

Move  WW  to  scalar,  mdiiect 

XORI 

\or  immediate 

WSRI. 

Shib  right  logical 

MXAVW 

Mine  WAV  to  WAV 

XORK 

Xtir  immediate  vs  condition  codes 

WSRl.I 

Shill  right  logical  immediale 

M\AVWI 

Move  WAV  to  WAV,  mdiiect 

Sl.l. 

Sliit'l  telt  logical 

WED 

Load  Reg  tioni  Mem 

SI.I.I 

Shift  left  logical  immediate 

WST 

Stole  Reg  to  Mem 

SRA 

Shift  right  aiithmetic 

WEABS 

Floating-point  absolute  value 

SRAI 

Sliitt  right  aiithnietic  immediate 

W  K.VDD 

Floalmg-jKiml  add 

Miscellaneous  Instructions 

KUI. 

Shift  right  logical 

WI'DIV - 

Floaling-|K)uit  divide 

l.DKl. 

Lock  l.oad 

KRLI 

Shift  right  logical  immediate 

wtmu. 

Moaiing-pouit  mullipK 

LOKS 

Li>ck  Sli>re 

l.l> 

Lvud  Keg  lii'm  Mem 

\\  F  \L<. 

fioalmg-poiiil  negate 

PROBE 

t*rE)he  address  ti'  deteimine 

Sloie  Keg  to  Mem 

wrsi  IS — 

Moaling-poml  subtiact 

locality 

wm 

Moaling-|HHnt  to  integer  conversion 

ll.O 

l-nci>de  lellmosi  one 

vmr 

Integei  to  tloatmg-pomt  conversion 

t'l.ti 

( leai  leltmost  one 
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Alphabetical  list  of 
instructions 


l  ABI.I’  2.  Preliminary  Kncotlin<i  «>f  I)l\ A  Instriiction  Set 


Inslruclion 

Foriiifil 

Kiu'uilini;  | 

6  bits 

5  bits 

Sbils 

S  bilH 

5  bits 

6  bits 

ADD 

R 

IK  II  1(1 11 

rl) 

rA 

iB 

OXXXX 

1  III  II 11  III 

R 

000011 

rD 

rA 

rB 

IXXXX 

100000 

ADDi; 

R 

000011 

rD 

rA 

rB 

OXXXX 

100001 

Tcnrr 

R 

000011 

rD 

rA 

rB 

IXXXX 

1 00001 

ADDI 

I 

100000 

rD 

rA 

immediate 

I 

100001 

rD 

rA 

iinmediale 

AND 

R 

000011 

rD 

rA 

rB 

OXXXX 

lOIOOO 

- 

R 

000011 

rD 

rA 

rB 

IXXXX 

lOIOOO 

ANDI 

I 

101000 

rD 

rA 

immediate 

ANliU' 

I 

101001 

rD 

rA 

immeiliule 

Rv 

B 

null 

oocce 

rA 

olTset 

Hv 

B 

Mini 

10((-(- 

K'-rei  alive  olTset 

B.A.V 

B 

IIIIOO 

oocce 

rA  1  olTsei 

B.Vv 

B 

IIIIOO 

lOCCC 

P('-ielaiive  otTset 

BNa 

B 

inioi 

oocce 

rA  1  olTsei 

BVa 

B 

inioi 

10(  (  ( 

PC'-rel  alive  olVsei 

(  AI.Ia 

B 

linn 

0IC(  c 

rA  1  oITsl-1 

TAI-Ia 

B 

linn 

IK  CC 

-lelaiive  otVsei 

(  AI.I..U 

B 

IIIIOO 

OICCC 

rA  1  i)0sei 

(All  Av 

B 

IIIIOO 

nccc 

P('-ieIalive  otVsei 

(  Al.I.Vi 

B 

nnoi 

OICC  c 

rA  1  ulIsL-i 

r.vi.i.Vv 

B 

nnoi 

nc(  (■ 

I’C-ieiaiive  olVsei 

(  I.O 

R 

0000  n 

rD 

rA 

00000 

OXXXX 

001001 

BIV 

R 

000011 

00000 

rA 

iB 

OXXXX 

loom 

DIM 

R 

000011 

00000 

rA 

rB 

IXXXX 

loom 

TLO 

R 

000011 

rD 

rA 

00000 

OXXXX 

001000 

KM 

I 

noon 

00000 

rA 

oil's  L*t 

I.l> 

I 

1 10000 

rD 

rA 

oil's  L*l 

■  OKI. 

I 

nono 

rD 

rA 

rifl'sel 

I.Okis 

I 

noni 

rD 

rA 

off'-Jt 

MKVIR 

R 

000000 

rD 

alrA 

00000 

XXXXX 

000010 

MKPR 

R 

000000 

rD 

prA 

00000 

xxxxx 

000000 

MFSPR 

R 

000001 

rD 

sptA 

00000 

XXXXX 

000100 

MTATR 

R 

000000 

airD 

rA 

00000 

xxxxx 

000011 

MTPR 

R 

000000 

prD 

rA 

00000 

XXXXX 

000001 

MTSPR 

R 

000001 

sptD 

rA 

00000 

xxxxx 

OOOIOI 

Ml  1. 

R 

0000  n 

00000 

rA 

iB 

OXXXX 

loono 

Ml  LI' 

R 

000011 

00000 

rA 

iB 

IXXXX 

loono 

57 


l  ABI.K  2.  Prcliminan  Kncoiliii”  l)l\.\  Insti  iiction  Set 


Instruction  I  Forniot 


l.ncouinx 


5  bits 


TPPWW 


oooww 


oooww 


oooww 


TPPWW 


IPPWW 


oxxxx 


IXXXX 


oxxxx 

IXXXX 


immediate 


immediate 


immediate 


onset 


R 

000000 

xxxxx 

XXXXX 

XXXXX 

XXXX 

Mini 

R 

0000 11 

iD 

rA 

iB 

OXXXX 

000000 

R 

000011 

R 

000011 

R 

000011 

R 

000011 

R 

000011 

R 

000011 

R 

000011 

R 

000011 

R 

000011 

R 

000011 

R 

000011 

R 

000011 

R 

000011 

R 

000011 

R 

000011 

R 

000001 

w 

000010 

w 

000010 

|\\ADD(: 

W 

000010 

wrD 

wrA 

wrB 

OPPWW 

100001 

mjrgjig 

W 

0000)0 

wrD 

wrA 

wrB 

1 PPWW 

100001 

|\\AND 

w 

000010 

wrD 

wrA 

wrB 

OPPWW 

101000 

w 

000010 

wrD 

wrA 

wrB 

IPPWW 

101000 

I'ABI.K  2.  Prcliminan  Kncodin^  of  DIN  A  Instruction  Set 


Inslruction  Foniiat 


w 

OlllOl 

w 

OlllOl 

w 

OlllOl 

w 

OlllOl 

w 

OlllOl 

w 

OlllOl 

w 

OlllOl 

w 

OlllOl 

w 

OlllOl 

w 

OlllOl 

w 

OlllOl 

w 

OlllOl 

w 

OlllOl 

w 

OlllOl 

w 

000010 

w 

000010 

w 

000010 

w 

000010 

w 

000010 

w 

000010 

w 

000010 

w 

000010 

w 

000010 

w 

000010 

Kncoilins 


5  bits  I  5  bits 


I'ABl.K  2.  Prcliminarv  Kncodinu  of  DIN  A  Instruction  Set 


l  ABLK  3.  Special-Purpose  Re”isters 
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I ABLK  4.  Pnitccti'd  Registers 


NAMI. 

PR  Number 

Dt:S<  RIPTION 

" 

.'J-I)it  pmcessoi  status  woiU 

ssw 

1 

2 

lt>-bil  etiMronmenl  idenlitier  register 

ir.VDR 

3 

32-bil  address  ot  laulling  instruction  (stored  salueol'PC  ) 

4-7 

32-bit  supetvisui  scratch  registers 

i:s\\ 

S 

KMR 

9 

32-bit  exception  mask  register 

i:sR 

10 

32-l)it  exception  set  register 

ICRR 

11 

321)11  exception  reset  register 

M  ADR 

12 

32-hu  faulting  memoix’  address 

TIMICR 

13 

32-l)it  piogiarnmable  delay  timer 

R(  [. 

14 

I.ou  Older  32  bits  ol'ieal-tirne  clock 

R(  II 

15 

Higli  Older  32  bits  orieal-time  clock 

I  AIR. K  5.  Address  Translation  Regislers 


AIR 

NAME 

Number 

DES(  RIPTION 

sc^mcni  oase  leiiisieis 


It  local  sevMiieiit  limit  registers 


segment  sirlual  base  legisleis 


32-bit  global  segment  limit  registcfs 


32-bit  global  segment  physical  base  registers 


Chapter  2  -  Instruction  Descriptions 


Notation 


Precedence 


This  chapter  gi\es  detailed  individual  instruction  descriptions.  We  use  Big-Tndian  byte  and  bit  labelini>,  meaning  that  bit  byte  0  is  the  most 
signilicant  Other  conventions  are  listed  in  the  table  below. 


r.VBI.l'  6.  Instruction  (ilossary 


Symbol 

Meaning 

Symbol 

Meaning 

A<-B 

Assignment 

\ii:m|i  A| 

Memory  eontents  at  ctteetivc  address  t;.\ 

A  11  H 

Bil  string  eoncatenation 

Owtiiiie 

1  lexadccimal  \  aloe 

x> 

X  replicaicvl  y  times 

iibvuiue 

Binary  v  aliie 

Scleetion  of  bits  y  through  z  from  x 

trX 

I'loating-pomt  register  X 

X  A\ 

X  bitwise  .ANDcd  w  ef  y 

(rX) 

(  onlents  1  general-purpose  register  X 

JT  V  1 

X  bilwise  ( iRed  willi  y 

I't 

Program  eounter 

.r0y 

X  bilwise  exelusive  t)Red  with  y 

lADR 

Inslruetion  address 

— iV 

linwi-.e  inversion  ol  x 

Note  that  the  lADR  ol'an  instruction  is  equisalent  to  the  PC  value  while  the  instruction  is  in  the  fetch  stage  of  the  pipeline 


fhe  follow  ing  table  giv  es  the  rules  of  precedence  and  associativ  ity  for  the  pseudocode  operators.  All  operators  on  the  same  line  have  equal 
precedence,  and  all  operators  on  a  given  line  hav  e  higher  precedence  than  those  on  the  lines  below  them 


r.VBLK  7.  Precedence  of  Pseudocode  Operators 


Openitor 

AssaciiilixiU 

x|n| 

Icll  to  riulil 

Icit  to  right 

x> 

left  to  right 

right  to  left 

X,  + 

left  to  right 

+,  - 

left  to  right 

left  to  right 

!=,  <,  <=,  >,  >- 

left  to  right 

0  ,  A 

lelt  to  right 

V 

left  to  right 

none 
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addv  -  Add 


Scalar  Unit 

add  rD,  rA,  rB  (C  =  0) 

addc  rD,  rA,  rB  (C  =  1) 


000011 


rD 


rA 


rB 


100000 


5  6 


10  II 


15  16 


20  21  22 


25  26 


/7)<-(/v/)  +  (r/i) 

The  sum  (rA)  +  (rB)  is  placed  into  rD. 

Other  registers  altered: 

•  irc  I ,  scalar  condition  code  registers:  LT,  (iT,  EQ,  CA 

•  Scalar  condition  code  OV  is  set  if  the  o|x;ration  causes  o\erllovv. 
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addcA*  -  Add  ExttMidcd 


Scalar  Unit 

addc  rD,  rA,  rB  (C  =  0) 

addec  rD,  rA,  rB  (C  =  1) 


00001 1 


rU 


rA 


rB 


100001 


5  6 


10  II 


15  16 


20  21  22 


25  26 


3 1 


rD<-(r.t)  +  (rB)  +  (\l 

The  sum  (rA)  '  (rB),  using  the  carry  bit  CA  as  the  carry  in,  is  placed  into  rD 
Other  registers  altered: 

•  irc  =  I ,  scalar  condition  code  registers:  LT,  (JT,  E(2,  CA 

•  Scalar  condition  code  OV  is  set  if  the  operation  causes  overllow 
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addi  -  Add  Immediate 


Scalar  Unit 

addi  rD,  rA,  IIMM 


100000 

rD 

rA 

IMM 

0  56  10  II  15  16  31 


The  sum  (rA)  •  IMM  (sign-extended  to  form  a  32-bit  \alue)  is  placed  into  rD. 
Other  registers  altered: 

•  Scalar  condition  ccxle  OV  is  set  if  the  oiK'ration  causes  overllow 
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aiklic  -  Add  Immediate  Recording  Condition  Code 

Scalar  Unit 

addie  rD,  rA,  IMM 


0  56  10  11  15  16 


The  sum  (rA)  +  IMM  (sign-extended  to  form  a  32-bit  value)  is  placed  into  rD. 
Other  registers  altered: 

•  Scalar  condition  code  registers:  LT,  (Tf,  E(3,  CA 

•  Scalar  condition  code  OV  is  set  if  the  o|x;ration  causes  o\  erllow. 


aiidv  -  AND 


Scalar  Unit 

and  rD,  rA,  rB  (C  =  0) 

aiidc  rD,  rA,  rB  (C  =  1) 


000011 

rD 

rA 

rB 

0 

xc 

lOIOOO 

0  56  10  II  15  16  20  21  22  25  26  31 


/•D  <-(/•./)  A  (/7i) 

rhe  contents  of  rA  are  ANDed  with  rB,  and  the  result  Is  placed  into  rD. 
Other  reuisters  altered: 

•  I I'C  =  I ,  scalar  condition  code  registers:  LT,  (iT,  UQ 
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aiidi  -  AND  Iinmccliatc 


Scalar  Unit 

aiuli  rD,  rA,  IMM 


10 1000 

rD 

rA 

IMM 

0  56  10  II  15  16 


F'hc  contents  of  rA  are  ANDed  with  IMM  (prepended  with  zeros  to  form  a  32-bit  value),  and  the  result  is  placed  into  rD. 
Other  registers  altered: 

•  None 
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aiidic  -  AND  I  in  mediate  Recording  Condition  Codes 


Scalar  Unit 

andic  rD,  rA,  IMIVl 


lOIOOl 

rD 

rA 

IMM 

0  56  10  II  15  16 


r/)^(/-,-l)  a(()'^||/.\/.1/) 

The  contents  ofrA  are  ANDed  with  IMM  (prepended  with  zeros  to  form  a  32-bit  value),  and  the  result  is  placed  into  rD. 
Other  registers  altered: 

•  Scalar  condition  code  registers:  I.T,  (ff,  EQ 
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b.v-  Branch 


b.v  rA,  offset  (register-relative  format) 


linn 

0 

0 

CCC 

rA 

offset 

0  5678  10  II  15  16  31 


bA'  offset  (PC-relative  format) 


linn 

0 

0 

CCC 

offset 

0  5 

6 

7 

8  10 

11  31 

il' scalar  condition  indicated  by  CCC 
it'PC-relatis  e  format 

PCi^lADR  •  {{ojfsetf^f  ||  ojfsei  ||  00) 


else 


O.yI  I  I  IM  IX’)  V  ((q//.V<.’/o)''*  II  o//se/ 1|  00) 


This  branch  instruction  is  conditional  upon  the  scalar  condition  specified  by  CCC.  For  the  register-relative  format,  the  target  address  is 
formed  by  ORing  the  offset  with  the  contents  of  rA.  For  the  PC-relative  format,  the  target  address  is  formed  by  adding  the  offset  to  the 
instruetion  address.  In  both  cases,  the  offset  is  considered  to  be  a  signed  instruction  count,  so  it  is  shifted  left  two  bits  and  sign-extended.  Fur¬ 
thermore,  the  lea.st  two  significant  bits  of  the  contents  of  rA  are  ignored  in  the  register-relati\ e  format  so  that  a  proix-’r  in.struction-aligned 
address  results.  The  next  instruction  is  always  executed  (one  delay  slot). 


(  (  ( 

Rc};isU*r-Kcla(i\  c 
Mnenioiiic 

F(  -Uclatixc 

Mnenioiiic 

()()() 

b  rA,  offset 

b  otfset 

001 

bet)  r.A.  offset 

bet)  offset 

010 

bne  rA.  offset 

bne  offset 

oil 

bit  rA,  offset 

bit  offset 

100 

ble  rA,  offset 

ble  otfset 
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( ( ( 

Rcgistcr-Relalixe 

Miieiiioiiic 

P(  '-Relali\e 

Miieiiioiiic 

101 

bgt  l  A,  offset 

bgt  offset 

IK) 

bge  rA.  offset 

bge  offset 

III 

bov  rA,  offset 

bo\  offset 

( >ilKr  registers  altered: 
•  None 


The  ret  instruction  is  a  simplified  mnemonic  for  b  r31,  0 
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ba.v-  Branch  on  All 


ba.v  rA,  offset  (register- relative  format) 


II 1100 

0 

0 

CCC 

rA 

oll'scl 

0  5  6  7  8  10  1 1  15  16 


ba.v  offset  (PC-relative  format) 


IlllOO 

0 

0 

CCC 

olVsct 

0  5 

6 

7 

8  10 

II  31 

il'condition  indicated  b\  CCC  is  true  tor  all  WideWord  datapaths 
if  PC-relative  Ibrmat 

PC  lA DR  ■  ( ( ojfselo  II  offset  ||  00 ) 

else 

P('  <—  ( ( rA  )  A  O.yI  I'l  l' I  I  I'C’ )  V  ( ( offset ||  offset  ||  00 ) 


This  conditional  branch  instruction  succeeds  if  the  condition  specified  b\  CCC  is  true  for  all  WideWord  datapaths.  I'or  the  register-relativ  e 
format,  the  target  address  is  formed  by  ORing  the  otfsel  w  ith  the  contents  of  rA.  For  the  PC-relativ  e  format,  the  target  address  is  formed  by 
adding  the  ofTset  to  the  instruction  address.  In  both  cases,  the  oO’set  is  considered  to  be  a  signed  instruction  count,  so  it  is  shifted  left  two  bits 
and  sign-extended.  Furthermore,  the  least  two  signifieant  bits  of  the  contents  of  rA  are  ignored  in  the  register-relative  format  so  that  a  proper 
instruction-aligned  address  results.  The  ne.xt  instruction  is  always  executed  (one  delay  slot). 


(  (  ( 

Kcgister-Rclativ  e 
Miicinoiiic 

F(-Kilalive 

Miit'iiioiiic 

()()() 

b  rA,  ofTset 

b  ofTset 

001 

baeq  rA,  ofi’set 

baeq  ofTset 

010 

bane  rA,  otVset 

bane  olTsel 

oil 

bait  rA,  ofTset 

bait  ofTset 

100 

bale  rA,  olTset 

bale  olTset 
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( ( ( 

Reuistcr-Rflatix  e 
Miieinonie 

P(  -RclatiM' 
Miienionic 

101 

bagt  rA.  otlset 

bagt  ofTset 

IK) 

bage  rA,  otlset 

bage  otlset 

III 

bao\  rA,  otlset 

bao\  otlset 

( )tlKr  registers  aliereJ: 
•  None 
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biLV-  Branch  on  None 


biLV  rA,  offset  (register-relative  format) 


II 1 101 

0 

0 

CCC 

rA 

otfset 

0  5 

6 

7 

8  10  11  15 

16  31 

biLV  offset  (PC-relative  format) 


IIIIOI 

0 

0 

CCC 

otfset 

0  5 

6 

7 

8  10 

II  31 

ifeondition  indicated  by  CCC  is  False  for  all  WideWord  datapaths 
if  PC-relative  format 

PC  <-  I  ADR  ■  (iojfselo)^  ||  offset  1|  00) 

else 

/Y'<-  ((/vl)  A  O.yI  FI  I  I  I  FC’)  V  ((offset q)'"'  ||  offset  ||  00) 


This  eonditional  branch  instruction  succeeds  if  the  condition  s|x;cified  by  CCC  is  false  for  all  WideWord  datapaths.  For  the  register-relative 
format,  the  target  address  is  formed  by  ORing  the  otfset  with  the  contents  of  rA.  For  the  PC-relative  format,  the  target  address  is  formed  by 
adding  the  otfset  to  the  instruction  address.  In  both  cases,  the  otfset  is  considered  to  be  a  signed  instruction  count,  .so  it  is  shifted  left  two  bits 
and  sign-e.xtended.  Furthermore,  the  least  two  significant  bits  of  the  contents  of  rA  are  ignored  m  the  register-relatise  format  so  that  a  proper 
instruction-aligned  address  results.  The  next  instruction  is  always  executed  (one  delay  slot). 


(  (  ( 

Rc{>ister-Etelati\  i‘ 
Miieitioiiic 

P(  -Rclatixe 

Mneinonic 

()()() 

b  rA,  offset 

b  otfset 

001 

bneq  rA,  otfset 

bneq  otfset 

010 

bnne  rA,  otfset 

bnne  otfset 

oil 

bnit  rA,  otfset 

bnlt  otfset 

too 

bnle  r.'\,  otfset 

bnle  otfset 
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( ( ( 

Register-Relative 

Miieinoiiie 

P(  -Relative 
Miieiiionic 

101 

bngt  rA.  otTset 

bngt  offset 

IK) 

bnge  rA,  offset 

bnge  offset 

III 

bno\  rA,  offset 

bno\  offset 

(  HIkt  registers  aliered 
•  None 
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calLv-  Call 


calLv  rA,  offset  (register-relative  format) 


null 

0 

□ 

CCC 

rA 

offset 

0  5678  10  II  15  16 


ealLv  offset  (PC-relative  format) 


llllll 

□ 

□ 

CCC 

olfset 

0 

5 

6 

7 

8  10  II 

31 

if  scalar  condiiion  indicated  by  CCC 

/•3I  ^lADR  I  8 

il'PC-relalive  format 

PC  <r-  /ADR  •  ( ( qffselo)^  ||  o/fset  ||  00) 

else 


P(’i-((rA)A  O.yFI  I  1)  rrC)  V  (lq//seffj)‘'*  ||  oj/xe/ 1|  00) 


This  call  instruction  is  conditional  ufHin  the  scalar  condition  specified  by  CCC.  For  the  register-relative  format,  the  target  address  is  formed 
b\  ORing  the  olfset  with  the  contents  of  rA.  For  the  PC-relative  format,  the  target  address  is  formed  b\  adding  the  offset  to  the  instruction 
address.  In  both  cases,  the  offset  is  considered  to  be  a  signed  instruction  count,  so  it  is  shifted  left  two  bits  and  sign-extended.  Furthermore, 
the  least  two  significant  bits  of  the  eontents  of  rA  are  ignored  in  the  register-relative  format  .so  that  a  proper  instruction-aligned  address 
results.  The  next  instruction  is  always  executed  (one  delay  slot).  The  effective  address  of  the  instruction  following  the  delay  slot  is  placed 
into  r3 1 . 


(  (  ( 

Retjister-Reliiliv  e 
Miiciiioiiic 

P(  -Rclalive 
Miicinonic 

()()() 

call  rA.  olfset 

call  olfset 

001 

calleq  rA,  olfset 

calleq  olfset 

010 

callne  rA,  olfset 

callne  olfset 
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m 

Ri‘}>isler-Ri‘lati\e 
Mill' moil  ic 

P(  -Ri'laliie 

Miit'inoiiic 

on 

calllt  rA,  ortset 

calllt  offset 

100 

callle  rA,  oO'set 

callle  offset 

101 

caligt  rA,  olVset 

caligt  offset 

110 

callge  rA,  otlset 

callge  offset 

111 

callov  r.A.  olTset 

callo\  offset 

( )ilKr  registers  aliered 
•  None 
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calla.v-  Call  on  All 


callaA'  rA,  offset  (register-relative  format) 


II 1 100 

0 

□ 

CCC 

rA 

olfsct 

0  5678  10  II  15  16  31 


calbuv  offset  (PC-relative  format) 


IlllOO 

0 

□ 

CCC 

olfsct 

0  5 

6 

7 

8  10 

11  31 

iTcondition  indicated  b\  CCC  is  true  for  all  WideWord  datapaths 
/•3I  i^IADR  I  8 
if  PC-relati\e  format 

PC  I A  DR  ■  ({ojyselQ)^\\offsel\\  00) 

else 

PC  ((rA )  A  O.xl-I'M  IIIC’)  V  ((offset  ||  offset  ||  00) 


fhis  conditional  call  instruction  succeeds  if  the  condition  s|x:cified  by  CCC  is  true  for  all  WideWord  datapaths.  For  the  register-relatn  e  for¬ 
mat.  the  target  address  is  formed  b\  ORing  the  otfset  with  the  contents  of  rA.  For  the  PC-relative  format,  the  target  address  is  formed  b\ 
adding  the  olTset  to  the  instruction  address.  In  both  cases,  the  otfset  is  considered  to  be  a  signed  instruction  count,  so  it  is  shifted  left  two  bits 
and  sign-e.xtended.  Furthermore,  the  least  two  significant  bits  of  the  contents  of  rA  are  ignored  m  the  register-relatise  format  so  that  a  proper 
instruction-aligned  address  results.  The  next  instruction  is  always  executed  (one  delay  slot).  The  eft'ective  address  of  the  instruction  follow¬ 
ing  the  delay  slot  is  placed  into  r3 1 . 


(  (  ( 

Rc{>isler-Relati\  c 
Miit'iiioiiic 

P(  -Kelalivc 
Miienionic 

()()() 

call  r.A,  otfset 

call  offset 

001 

callaeq  r.A.  offset 

callaeq  offset 

010 

callane  rA,  offset 

callane  offset 
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( ( ( 

Rft:istcr-Reliiti\c 

Miieinoiiic 

P(  -RelaiRe 
Miicnioiiic 

oil 

callalt  rA,  olYset 

callalt  olTset 

100 

callale  rA,  olTset 

callale  oHset 

101 

callagt  rA,  oOset 

callagt  ortset 

no 

callage  rA.  ortset 

callage  ortset 

III 

callaov  rA,  otVset 

callaov  ortset 

( )ther  registers  altered 
None 
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calliLV-  Call  on  None 


calliLV  rA,  offset  (register-relative  format) 


II 1 101 

0 

□ 

CCC 

rA 

otlfsct 

0  5 

6 

7 

8  10 

11  15 

16  31 

eallmv  offset  (PC-relative  format) 


IIIIOI 

0 

0 

CCC 

oll'sct 

0 

5 

6 

7 

8  10  II 

31 

it' condition  indicated  b\  CCC  is  False  for  all  WideWord  datapaths 
/•3I  4- /.  (/.)/?  t  8 
ifPC-relati\e  format 

PC  <r-  I A  DR  +  ( ( ojfselo)^  il  offset  ||  00) 

else 

((/vO  A  O.Yl'n  i  l  l'FC)  V  ({offset^)'*  ||  offset  ||  00) 


This  conditional  call  instruction  succeeds  if  the  condition  specified  by  CCC  is  false  for  all  WideWord  datapaths.  For  the  register-relative  for¬ 
mat.  the  target  address  is  formed  b\  ORing  the  otVset  with  the  contents  of  rA.  For  the  PC-relative  format,  the  target  address  is  formed  by 
adding  the  otTset  to  the  instruction  address.  In  both  cases,  the  otfset  is  considered  to  be  a  signed  instruction  count,  so  it  is  shifted  left  two  bits 
and  sign-extended.  Furthermore,  the  least  two  significant  bits  of  the  contents  of  rA  are  ignored  in  the  register-relative  format  so  that  a  proper 
instruction-aligned  address  results,  fhe  next  instruction  is  always  executed  (one  delay  slot),  fhe  etfective  address  of  the  instruction  follow¬ 
ing  the  delay  slot  is  placed  into  r3 1 . 


(  (  ( 

Ucuister-Relativ  e 
Miieinoiiic 

P(  -Rclativ  e 
Miieiiionic 

()()() 

call  rA,  olTset 

call  offset 

001 

callneq  rA,  offset 

callneq  otfset 

010 

calinne  rA,  olTset 

calinne  otfset 

80 


( ( ( 

Registcr-Rclali\c 

Miieinoiiic 

P(  -Rilalixe 

Mncinoiiic 

oil 

callnit  rA.  olTset 

callnit  olTset 

100 

callnie  rA,  otVset 

callnie  olTset 

101 

calingt  rA,  olTset 

calingt  olTset 

no 

callage  rA,  olTset 

calinge  olTset 

III 

callnov  rA,  olTset 

callnov  olTset 

( )iher  registers  ahct\.'J 
None 
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cloA'  -  Clear  Leftmost  One 


Scalar  Unit 

do  rD,  rA  (C  =  0) 

doc  rD,  rA  (C=l) 


0000 II 

rD 

rA 

00000 

0 

><c 

00 1001 

0  56  10  II  15  16  20  21  22  25  26  31 

for  I  3 1  to  0 
il<rA), 

Imp  *—  i 

r/ )<-(/•■■()  A  ( l""'’  ||0{|  1^'-""/’) 

The  contents  ofrA  are  searched  to  find  the  leftmost  bit  that  is  a  one.  The  resulting  value  ofelearing  this  bit  but  retaining  the  other  bits  is  then 
stored  in  rD 

Other  registers  altered: 

•  irc  =  I,  scalar  condition  code  registers:  LT,  (IT,  E(3 
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div  -  Divide 


Scalar  Unit 

div  rA,  rB 


(XtOOII 

00000 

rA 

rB 

0 

X 

loom 

0  56  10  II  15  16  20  21  22  25  26  31 


)  +  (/•«) 

L() «—  (rA  )mod(r/<) 

The  contents  ofrA  are  di\  ided  by  the  contents  of  r[3,  treating  both  operands  as  signed  values.  No  condition  codes  are  updated  as  a  result  of 
this  operation.  When  the  operation  completes,  the  quotient  word  is  loaded  into  special  register  III,  and  the  remainder  word  is  loaded  into  spe¬ 
cial  register  L().  This  operation  requires  38  clock  cycles  in  the  worst  case  and  thus  requires  some  amount  of  scheduling. 

Other  registers  altered: 

•  None 
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(livu  -  Divide  Unsigned 

Scalar  I 'nit 

divu  rA,  rB 


0000 1 1 


00000 


rA 


rli 


1001 1 


5  6 


10 


15  16 


20  21  22 


25  26 


)  +  (/■«) 

l.O  *—(  rA  )inod(  rli) 

The  contents  of  rA  are  di\  ided  by  the  contents  of  rB,  treating  both  operands  as  unsigned  values.  No  condition  codes  are  u|xiated  as  a  result 
of  this  operation  When  the  operation  completes,  the  quotient  word  is  loaded  into  special  register  111,  and  the  remainder  word  is  loaded  into 
special  register  LO.  This  operation  requires  38  cliKk  cycles  in  the  worst  case  and  thus  requires  some  amount  of  scheduling. 

Other  registers  altered: 

•  None 
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elo  -  Encode  Leftmost  One 


Scalar  Unit 

elo  rD,  rA 


0000  n 

rD 

rA 

00000 

0 

001000 

0  56  10  II  15  16  20  21  22  25  26 


imp «-  OjFFFFFFFF 
For  i  =  3 1  to  0 
it'(rA), 

Imp  *—  I 
rl)  *—  Imp 

The  contents  ofrA  arc  searched  to  find  the  leftmost  bit  that  is  a  one.  The  index  of  this  bit  is  then  stored  in  rD.  If  no  bit  of  the  contents  of  rA 
is  a  one,  the  \aliic  OxFFFFFFFF  is  stored  in  rD. 

Other  registers  altered: 

•  None 
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icli  -  Instruction  Cache  Line  Invalidate 


icii  rA,  offset 


noon 

00000 

rA 

otTset 

0  56  10  II  15  16 


/;.-l  ||  oJfseD 

The  16-bit  olTscl  is  sign-extended  and  added  to  the  contents  ofrA  to  form  the  efYective  address  KA  Ifthe  HA  is  contained  in  the  instruction 
cache,  the  cache  line  containing  that  address  is  invalidated.  This  instruction  may  be  executed  only  in  supervisor  mode. 

Other  registers  altered: 

•  None 
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Ill  -  Load  General-Purpose  Register 


Scalar  Unit 

III  rD,  rA,  offset 


1 10000 

rD 

rA 

otfset 

0  56  10  II  15  16  31 


/;■.•)  <-  O.vFFFFFFFC  a  ( ( H  )  -  ( ( offsela )  “  II  ojfsci ) ) 

/•/)«-  MEM|EA] 

The  16-bit  ortset  is  sign-extended  and  added  to  the  contents  ol'rA  to  form  the  etfective  address  E;A.  The  32-bit  word  at  the  menior\  location 
specified  by  I-A  (ignoring  the  least  two  significant  bits  to  ensure  a  32-bit  aligned  address)  is  then  loaded  into  rD 

Other  registers  altered: 

•  None 
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loki  -  Lock  Load 


Scalar  Unit 

lokI  rD,  rA,  offset 


EIOEIO 

rO 

rA 

offset 

0  56  10  II  15  16 


I- A  <-  O.vFFFFFFFC'  a ( -  HoffselQ)'"’  ||  oJ/seD) 

/•/)<-  MFM|EA] 
l.()CK<-  I 

I  hc  16-bit  ofTset  is  sign-extended  and  added  to  the  contents  oFrA  to  form  the  effective  address  E:A  The  32-bit  word  at  the  memorx  location 
specified  by  E-A  (ignoring  the  least  two  significant  bits  to  ensure  a  32-bit  aligned  address)  is  then  loaded  into  rE3.  The  hardware  lock  bit  is 
also  set  and  remains  set  until  a  loks  instruction  is  executed  or  an  exception  occurs. 

Other  registers  altered: 

•  None 


88 


loks  -  Lock  Store 


Scalar  Unit 

loks  rD,  rA,  offset 


IIOIII 

rD 

rA 

OlTset 

0  56  10  II  15  16 


I:a  <-  O.vFFFFFFFC’  a  ( ( r.-l )  -  ( ( ofjsel^)'^  ||  ojfsel )) 
ir(L(K’K=  1) 

MEMfEA)  <-  rO 

l.(M  K «-  0 

The  16-bit  olTset  is  sign-extended  and  added  to  the  contents  ot'rA  to  form  the  elfective  address  EA.  The  32-bit  word  contents  of  rD  are  con¬ 
ditionally  stored  at  the  memory  location  specified  by  EA  (ignoring  the  least  two  signitleant  bits  to  ensure  a  32-bit  aligned  address).  The 
success  or  failure  of  the  store  operation  is  indicated  by  the  contents  of  rD  after  e.xeculion  of  the  instruction.  If  an  exception  occurs  between 
the  last  lokl  and  this  loks  instruction,  the  store  is  inhibited  from  taking  place  and  the  loks  fails.  The  operation  of  loks  is  undefined  when  the 
address  is  dilferent  from  the  address  used  in  the  la.st  lokl 

Other  registers  altered: 

•  None 
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iiifatr  -  Move  from  Address  Translation  Register 

Scalar  Unit 

iiifatr  rD,  atrA 


000000 

rD 

atrA 

00000 

0 

000010 

0  56  10  11  15  16  20  21  22  25  26  31 


rl)  *—  ( iiIrA ) 

The  contents  of  address  translation  register  atrA  are  stored  in  rD.  A  list  of  the  address  translation  registers  and  their  encoding  is  found  in 
fable  5  fhis  instruction  ma\  be  executed  only  in  supers  isor  mode. 

Other  registers  altered: 

•  None 


90 


iilfpr  -  Move  from  Protected  Register 


Scalar  Unit 

Iilfpr  rD,  prA 


000000 

rD 

prA 

00000 

0 

XT 

000000 

0  56  10  II  15  16  20  21  22  25  26  31 


r/)«—  {prA) 

flic  contents  of  protected  register  prA  are  stored  in  rD.  A  list  of  the  protected  registers  and  their  encoding  is  found  in  fable  4.  fhis  instruc¬ 
tion  ma>  be  executed  only  in  superv  isor  mode. 

Other  registers  altered: 

•  None 
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mfspr  -  Move  from  Special-Purpose  Register 


Scalar  Unit 

mfspr  rD,  sprA 


000001 

rD 

sprA 

00000 

0 

><c 

000100 

0  56  10  II  15  16  20  21  22  25  26 


rl)i-LsprA) 

The  contents  ot'spccial-puri'xise  register  sprA  are  stored  in  rD.  A  list  of  the  special-purpose  registers  and  their  encoding  is  found  in  fable  3. 
Other  registers  altered: 

•  None 
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iiitatr  -  Move  to  Address  Translation  Register 

Scalar  Unit 

mtatr  atrD,  rA 


000000 

atrD 

rA 

00000 

0 

><c 

0000 II 

0  56  10  II  15  16  20  21  22  25  26 


mr!)  «—  ( r.4 ) 

The  contents  ofuencral-puri'Kise  register  rA  are  stored  m  address  translation  register  atrD.  A  list  of  the  address  translation  registers  and  their 
encoding  is  found  in  Table  5.  This  instruction  ma\  be  executed  only  in  superv  isor  mode 

Other  registers  altered: 

•  None 
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nitpr  -  Move  to  Protected  Register 


Scalar  Unit 

iiitpr  prD,  rA 


000000 

prD 

rA 

00000 

0 

X 

000001 

0  56  10  II  15  16  20  21  22  25  26  31 


prO<-irA) 

The  contents  of  general-purp<ise  register  rA  are  stored  in  protected  register  prD.  A  list  of  the  protected  registers  and  their  encoding  is  found 
in  fahle  4.  This  instruction  nia\  be  executed  only  in  supen  isor  mode 

Other  registers  altered: 

•  None 
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iiitspr  -  Move  to  Special-Purpose  Register 


Scalar  Unit 

iiitspr 

sprD,  rA 

c 

00000 1  sprD 

rA 

00000  0 

OOOIOI 

0 

56  10  II 

15  16  20  21  22 

25  26  3 1 

sprl)*-{rA) 

The  contents  of  general-purpose  register  rA  are  stored  in  special-purpose  register  sprD.  A  list  of  the  special-purpose  registers  and  their 
encixling  is  found  in  Table  3. 

Other  registers  altered: 

•  None 
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mill  -  Multiply 


Scalar  Unit 

null  rA,  rB 


000011 

00000 

rA 

rB 

0 

X 

lOOIlO 

0  56  10  II  15  16  20  21  22  25  26  31 


!.()  <—  HrA )  X  •'■/i))32,63 

The  contents  ofrA  are  multiplied  b\  the  contents  ot'rB,  treating  Iwth  o(>erands  as  signed  \  alues.  No  condition  codes  are  updated  as  a  result 
of  this  operation.  When  the  operation  completes,  the  low-order  word  of  the  double  result  is  loaded  into  special  register  L(),  and  the  high- 
order  word  is  loaded  into  special  register  HI  This  operation  requires  4  clwk  cycles  and  thus  requires  some  amount  of  scheduling. 

Other  registers  altered: 

•  None 
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miilii  -  Multiply  Unsigned 


Scalar  Unit 

mnlii  rA,  rB 


000011 


00000 


rA 


rB 


1001 10 


5  6 


10 


15  16 


20  21  22 


25  26 


31 


W/«-((r.-l)x(/-/}))o  3, 

The  contents  of  rA  are  multiplied  by  the  contents  of  rB,  treating  both  o|vrands  as  unsigned  values.  No  condition  codes  are  updated  as  a  result 
of  this  operation.  When  the  operation  completes,  the  low-order  word  of  the  double  result  is  loaded  into  special  register  L(),  and  the  high- 
order  word  is  loaded  into  special  register  1 11  This  operation  requires  4  clock  cycles  and  thus  requires  some  amount  of  scheduling. 

Other  registers  altered: 

•  None 
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nivsw.v  -  Move  from  Scalar  to  WideWord 


iiivswM’  wrD,  rA,  index 
nivsvvr/;>r  wrD,  rA 


(T  =  0) 
(T=l) 


000100 

wrD 

rA 

index 

B 

PPWVV 

000100 
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Variable  \  aliies  in  the  following  equations  are  as  follows: 


W  W  X  aluc 

size 

mask 

00 

8 

Oblllll 

01 

16 

ObllllO 

10 

32 

OblllOO 

base  «—  iiiilcx  a  mask 
if(T  =  0) 

*'  ^^^base  X  8.  {base  x  8)  +(s/rtf  -  1)  ^  ^  ^32  -  size).  31 

else 

lor  i  =  0  to  (256  -  si/e)  b\  size 

+  I)  ^  >(32-iirf).3l 

Il'TK),  some  portion  or  all  ofthe  contents  olrA  are  transferred  to  a  subfield  of  wrD,  starting  at  the  byte  specified  by  the  byte  index.  Depend¬ 
ing  on  the  size  ofthe  data  to  be  transferred,  the  lea.st  significant  bits  of  the  index  may  be  ignored  to  ensure  proper  alignment.  If  T=  I,  the 
contents  of  r.A  are  replicated  to  form  a  256-bit  Nalue  which  is  transferred  to  wrD,  subject  to  the  participation  mode  specified  by  PP. 

Other  registers  altered: 

•  None 
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iiivswi  -  Move  from  Scalar  to  WideWord  Indirect 


mvsvviH’  wrD,  rA,  rB 


000100 

wrD 

rA 

rB 

0 

ooww 

lOOlOO 

0  56  10  II  15  16  20  21  22  25  26 


Variable  values  in  the  follow  ing  equations  are  as  follows: 


WAV  Value 

si/e 

mask 

00 

s 

Oblllll 

01 

16 

ObllllO 

10 

32 

obinoo 

base  <—  ( rli)2-i  ji  a  mask 
^^^hase  X  S.  x  8)  +(jt:e  -  I )  ^  ^  ^32  -a/re),  31 

Some  portion  or  all  of  the  contents  of  rA  are  transferred  to  a  subtield  of  wrD,  starting  at  the  b\te  specified  b\  the  low-order  bit  contents  of 
rB  Depending  on  the  size  of  the  data  to  be  transferred,  the  least  significant  bits  of  the  contents  of  rB  ma>  be  ignored  to  ensure  proper 
alignment. 

Other  registers  altered: 

•  None 
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nivws  -  Move  from  WiilcWoril  to  Scalar 


nivwsiv  rD,  wrA,  index 


Variable  \  alues  in  the  following  equations  are  as  follows: 


ha.se  *—  iiit/e.y  a  mask 


-size),  31  ^  x8.  (base IS  9i  +  (size  -  1 ) 

il'(size  !=  32) 


'■^V(32-jirf-l)^  ® 


(.12-s/r<’) 


A  subfield  of  the  contents  of  wrA  starting  at  the  b>te  specified  b\  the  b\  te  index  are  transferred  to  rD.  Depending  on  the  size  of  the  data  to 
be  transferred,  the  least  significant  bits  of  the  index  nia\  be  ignored  to  ensure  proper  alignment.  For  data  sizes  less  than  32  bits,  the  high- 
order  bits  of  rD  are  cleared. 


Other  registers  altered: 


•  None 


mvwsi  -  Move  from  WideWord  to  Scalar  Indirect 
iiiv  wsiH’  rD,  wrA,  rB 


Variable  \  alucs  in  the  Ibllovving  equations  are  as  follows: 


ha.se  «—  ( rH)2^  31  a  mask 


31  ^  ^  1*  ^haseysH.  (/i/ijfX  8)  +  (.size  -  I) 

il'tsize  !=  32) 


(32  -  s/zr) 


A  sublkld  of  ihc  contents  of  wrA  starting  at  the  b>te  specified  b\  the  low-order  bits  of  the  contents  of  rB  are  transferred  to  rD  Depending 
on  the  size  of  the  data  to  be  transferred,  the  least  significant  bits  of  the  contents  of  rB  may  be  ignored  to  ensure  proper  alignment.  For  data 
sizes  less  than  32  bits,  the  high-order  bits  of  rD  are  cleared. 


Other  registers  altered; 


•  None 
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iiivwwA*  -  Move  from  WiileWord  to  WicIcWord 


mvww/;  wrD,  wrA  (T  =  0) 

iiiv  w\vr/;H»  wrD,  vvrA,  index  (T=  1) 


000100 

wrD 

wr.>\ 

index 

B 

PPWVV 

000000 

0  56  10  II  15  16  20  21  22  25  26 


Variable  \  alucs  in  the  follow  ing  equations  are  as  follows: 


WAV  Value 

size 

iiiu.sk 

00 

S 

Oblllll 

01 

16 

ObllllO 

10 

32 

OblllOO 

hiisc  *—  imlcx  A  mask 
11(1  =  0) 

HT/)  «—  ( IIT.-I  ) 

else 

lor  i  0  to  (256  -  si/.e)  b\  size 

I -t- (size  -  I)  ^  I II  I'-'l  ^ffasexS.  (hasexH)  s~(size-  O 

irT=0,  the  entire  256-bit  contents  ofwrA  are  transferred  to  wrD,  subject  to  the  participation  mode  s|'>ecified  b\  PP  II  T=l,  the  subfield  of 
wrA  starting  at  the  byte  specified  by  the  b\te  index  and  of  the  size  indicated  by  the  W\V  bits  is  replicated  to  form  a  256-bit  value  w  hich  is 
transferred  to  wrD,  subject  to  the  partici|xition  mode  specified  by  PP.  De|x:ndmg  on  the  size  of  the  data  to  be  transferred,  the  least  significant 
bits  of  the  index  ma\  be  ignored  to  ensure  proper  alignment. 

Other  registers  altered: 

•  None 
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iiivwwir  -  Move  from  WidcWord  to  Wick* Word  Indirect  Replicating 


niv  wvvir/;M’  wrD,  wrA,  rB 


000100 

wrD 

wrA 

rB 

□ 

PPWW 

100000 
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V’ariable  \  alucs  in  ihc  follow  ing  equations  are  as  follows: 


WW  \iiluf 

size 

iiia.sk 

00 

s 

Oblllll 

01 

16 

ObllllO 

10 

32 

OblllOO 

hiise  «—  ( rH)2j  31  a  iiia.sk 
for  1  0  to  (256  -  sizei  b>  size 

"  +  (.si:e-  I)  X  8.  (ftosfX 81  +(«;»•-  I) 

The  subfield  of  wrA  starting  at  the  byte  sjxieitied  by  the  low-order  bits  of  the  contents  of  rli  and  olThe  size  indicated  by  the  VVW  bits  is  rep¬ 
licated  to  form  a  256-bit  \alue  w  hich  is  transferred  to  w  rD.  subject  to  the  participation  mode  specified  by  PP  Depending  on  the  size  of  the 
data  to  be  transferred,  the  least  significant  bits  of  the  contents  of  rB  may  be  ignored  to  ensure  proper  alignment 

Other  registers  altered: 

•  None 
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iiotA'  -  NOT 


Scalar  Unit 

not  rD,  rA  (C  =  0) 

note  rD,  rA  (C=l) 


0000 II 

rD 

rA 

00000 

0 

X 

101110 

0  5  6  10  1 1  15  16  20  21  22  25  26  31 


rD^^ir.l) 

The  bitwise  inversion  ol'the  contents  ofrA  is  placed  into  rD. 
Other  registers  altered: 

•  irc  =  I,  scalar  condition  code  registers:  LT,  (IT,  EC? 
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or.v  -  OR 


Scalar  Unit 

or  rD,  rA,  rB  (C  =  0) 

ore  rD,  rA,  rB  (C  =  1) 


000011 

rD 

rA 

rB 

0 

X 

lOllOO 

0  56  10  II  15  16  20  21  22  25  26  31 


/•/)<-(/•./)  V  (r/i) 

The  contents  ot'rA  are  ORed  with  rU,  and  the  result  is  placed  into  rD. 
Other  registers  altered: 

•  I rc  =  I,  scalar  condition  ctxle  registers:  LT,(iT,HQ 
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ori  -  OR  Iinmcdiatc 


Scalar  L’nit 

ori  rD,  rA,  IMM 


lOllOO 

rD 

rA 

IMM 

0  56  10  II  15  16  31 


/•/)<-(/vl)  v(()'*||/.l/A/) 

The  contents  ofrA  are  ORed  with  IMM  (prepended  with  zeros  to  form  a  32-bit  \  alue),  and  the  result  is  placed  into  rD. 
Other  registers  altered: 

•  None 
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oric  -  OR  I  111  mediate  Recording  Condition  Codes 

Scalar  Unit 

oric  rD,  rA,  IIMIVI 


lOIIOI 

rD 

rA 

IMM 

0  56  10  II  15  16  31 


Ihc  contents  ofrA  arc  ORed  with  IMM  (prepended  with  zeros  to  form  a  32-bit  \  alue),  and  the  result  is  placed  into  rD. 
Other  registers  altered: 

•  Scalar  condition  code  registers:  I.  T,  (ff,  EU) 
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oris  -  OR  Iminediatc  Shifted 


Scalar  L'nit 

oris  rD,  rA,  IMM 


lOIIIO 

rD 

rA 

IMM 

0  56  10  II  15  16  31 


/•/)<-( /vl )  v(/.\ /A/ II  O'^) 

The  contents  of  rA  are  ORed  w  ith  IMM  (appended  w  ith  zeros  to  form  a  32-bit  value),  and  the  result  is  placed  into  rt3 
Other  registers  altered: 

•  None 
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probe  -  Probe  Address 


Scalar  Unit 

probe  rD,  rA,  offset 


1 10010 

rD 

rA 

oElsct 

0  5  6  10  1 1  15  16  31 


f''A  <—  (rA)  +  ( (o/Z-vc/q)'*  II  ojfsci) 
il'EiA  IS  l(K'ally  mapped 
rl)  *r-  OxFFFFFFFF 

else 

W  )  <-  0x00000000 

The  16-bil  olVset  is  sipn-extended  and  added  to  the  contents  ofrA  to  form  the  elTective  address  EiA.  The  elVective  address  is  then  forwarded 
to  the  address  translation  hardware  to  determine  if  the  address  is  a  \alid  local  address  The  success  or  failure  of  the  operation  is  indicated  b\ 
the  contents  of  rD  after  execution  of  the  instruction. 

Other  registers  altered: 

•  None 


no 


rfc  -  Return  from  Exception 


rfe 


0 


56  10  II  15  16  20  21  25  26  31 


PC  (I  ADR) 


The  proeram  counter,  PC,  is  loaded  with  the  contents  of  the  protected  register  1‘ADR  Similarly,  the  PSW  is  loaded  with  the  contents  of  SSW. 
The  next  instruction  is  always  executed  (one  delay  slot). 

Other  registers  altered: 

•  None 
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sILv  -  Shift  Left  Logical 


Scalar  Unit 

sll  rD,  rA,  rB  (C  =  0) 

sllc  rD,  rA,  rB  (C  =  1) 


000011 

rD 

rA 

rB 

0 

X 

000000 

0  56  10  II  15  16  20  21  22  25  26 


||0* 

The  contents  ol'rA  are  shifted  left  In  the  number  of  bits  sixicified  by  the  low  order  five  bits  contained  as  contents  of  rlJ,  inserting  zeros  into 
the  low  order  bits  of  the  result  The  result  is  placed  into  rD. 

Other  registers  altered: 

•  If  C  =  I ,  scalar  condition  crxle  registers:  LT,  (iT,  EQ 
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sllLv-  Shift  Left  Logical  Iinmediatc 

Scalar  Unit 

slli  rD,  rA,  shift  amoiint  (C  =  0) 

silic  rD,  rA,  shift  amoiint  (C  =  1) 


0000 II 

iD 

rA 

shift  amount 

0 

X 

000010 

0  56  10  II  15  16  20  21  22  25  26  31 


.y «—  shiftanioum 

The  contents  of  rA  arc  shil’ted  left  b\  shifi  anunwi  bits,  inserting  zeros  into  the  low -order  bits  of  the  result.  The  result  is  placed  into  rID. 
Other  registers  altered: 

•  I  f  C  =  I ,  scalar  condition  ctxle  registers:  LT,  (iT,  EQ 
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sraA*  -  Shift  Right  Arithmetic 


Scalar  Unit 

sra  rD,  rA,  rB  (C  =  0) 

srac  rD,  rA,  rB  (C  =  1) 


0000 II 

rD 

rA 

rB 

0 

X 

OOOIOI 

0  56  10  II  15  16  20  21  22  25  26  31 


v «- (r/i),,  j, 

The  contents  ol'rA  are  shifted  right  b\  the  number  of  bits  specified  by  the  low  order  five  bits  contained  as  contents  of  rli,  sign-extending  the 
high-order  bits  of  the  result.  The  result  is  placed  into  rD. 

Other  registers  altered: 

•  IfC  =  I,  scalar  condition  ctxie  registers:  LT,  CiT,  UQ 
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sralv  -  Shift  Right  Arithmetic  Immediate 

Scalar  Unit 

srai  rD,  rA,  shift  amoiiiit  (C  =  0) 

sraic  rD,  rA,  shift_amount  (C  =  1) 


0000 II 

rD 

rA 

shift  amount 

0 

X 

000111 

0  56  10  II  15  16  20  21  22  25  26  31 


s  «—  shitt  amouni 

The  contents  ofrA  are  shitted  right  b\  shift  amoimt  bits,  sign-extending  the  high-order  bits  of  the  result.  The  result  is  placed  into  rD. 
Other  registers  altered: 

•  IfC  I,  scalar  condition  erxie  registers:  LT,  (iT,  liQ 
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srLv  -  Shift  Right  Logical 


Scalar  Unit 

srI  rD,  rA,  rB  (C  =  0) 

sric  rD,  rA,  rB  (C  =  1) 


00001 1 

rD 

rA 

rB 

0 

X 

000001 

0  56  10  II  15  16  20  21  22  25  26  31 


>27.  31 

The  contents  ofr.'V  arc  shitted  right  b\  the  number  ofbits  specified  b\  the  low  order  five  bits  contained  as  contents  of  rB,  inserting  zeros  into 
the  high-order  bits  of  the  result,  fhe  result  is  placed  into  i  D 

Other  registers  altered: 

•  IfC  I,  scalar  condition  code  registers:  LT,  (iT,  IvQ 
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srilv-  Shift  Right  Logical  Iminediatc 

Scalar  Ijnii 

srii  rD,  rA,  shift  aniount  (C  =  0) 

sriic  rD,  rA,  shift  a mount  (C  =  1) 


0000 II 

rD 

rA 

shift  amount 

0 

X 

0000 II 

0  56  10  11  15  16  20  21  22  25  26 


.V «—  shiftanuniiu 

The  contents  ofrA  are  shilted  right  b\  shift  aiiioimi  bits,  inserting  zeros  into  the  high-order  bits  of  the  result.  The  result  is  placed  into  rD. 
Other  registers  altered: 

•  If  C  =  I ,  scalar  condition  ctxle  registers:  LT,  (ST,  E;Q 
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st  -  Store  General-Purpose  Register 


Scalar  Unit 

st  rD,  rA,  offset 


IIOOOI 

rD 

rA 

offset 

0  56  10  II  15  16  31 


1:A  <-  O.vFFFFFFFC'  a  ( (H )  -  ||  offset) 

MEM(EA| «-  rP 

The  1 6-bit  olTset  is  sign-extended  and  added  to  the  contents  ot'rA  to  form  the  elVective  address  FA.  The  32-bit  word  contents  of  rD  are  stored 
at  the  mcmor\-  liKation  specified  by  ELA  (ignoring  the  least  two  significant  bits  to  ensure  a  32-bit  aligned  address). 

Other  registers  altered: 

•  None 
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siib.v  -  Subtract 


Scalar  Unit 

sub  rD,  rA,  rB  (C  =  0) 

subc  rD,  rA,  rB  (C  =  1) 


0000 II 

rD 

rA 

rB 

0 

X 

lOOOlO 

0  56  10  II  15  16  20  21  22  25  26  31 


/7)<-(/vn  +  -.(/7i)  t  1 

The  contents  ol'rB  are  subtracted  I  roni  the  contents  of  rA.  and  the  result  is  placed  into  rD 
Other  reuisters  altered: 

•  If  C  =  I ,  scalar  condition  code  registers:  LT,  (iT,  EQ,  CA 

•  Scalar  condition  code  OV  is  set  if  the  o|vration  causes  overllow. 
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slibcA'  -  Subtract  Extended 


Scalar  Unit 

subc  rD,  rA,  rB  (C  =  0) 

siibcc  rD,  rA,  rB  (C  =  1) 


0000 II 

rD 

rA 

rB 

0 

X 

1000 II 

0  56  10  II  15  16  20  21  22  25  26  31 


/7)^(/vl  )  +  -.(/•«)  ’  fVI 

rho  contents  ofrB  are  subtracted  from  the  contents  of  rA,  using  the  carrv  bit  CA  as  the  carrv  in,  and  the  result  is  placed  into  rl) 
Other  registers  altered: 

•  I  f  C  =  I ,  scalar  condition  code  registers:  I.T,  (iT,  EQ,  CA 

•  Scalar  condition  code  OV  is  set  if  the  o|x;ration  causes  ovcrllow. 
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siibii  -  Subtract 


Scalar  Unit 

siibii  rD,  rA,  rB 


0000 II 

rD 

rA 

rB 

□ 

X 

lOOlOO 

0  56  10  II  15  16  20  21  22  25  26  31 


/7)<-(/vl)  +  -,(/7i)  I  1 

The  contents  of  rB  are  subtracted  from  the  contents  of  rA,  and  the  result  is  placed  into  rD.  This  instruction  is  identical  to  sub  except  that  the 
()V  condition  code  is  updated  to  reflect  unsigned  arithmetic. 

Other  registers  altered: 

•  Scalar  condition  code  registers:  LT,  (ff,  E:Q,  CA 

•  Scalar  condition  code  OV  is  set  if  the  o|x:ration  causes  overflow. 
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sys  -  System  Call 


sys 


000001 


code 


000000 


0 


5  6 


25  26 


31 


A  system  call  is  made  by  setting  bit  10  of  the  ESW  deception  Source  Word)  register  which  in  turn  triggers  an  exception,  liefer  to  Chapter 
0  of  the  DIVA  PIM  Node  Architecture  manual  for  details  regarding  exceptions. 

Other  registers  altered: 

•  None 
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waddv  -  WitIcWord  Add 


WideWord  Unit 

wadd/;»i’  wrD,  vvrA,  wrB  (C  =  0) 
waddc/;H’  wrD,  vvrA,  wrB  (C  =  1) 


000010 

wrD 

wrA 

wrB 

0 

PPWW 

100000 

0  56  10  11  15  16  20  21  22  25  26  31 


Variable  values  in  the  follow  ing  equations  are  as  follows: 


\VW  \alu* 

sue 

00 

8 

01 

16 

to 

32 

for  i  =  0  to  (256  -  si/c)  b\  size 

il'PP  bits  and  conditions  are  set  accordingly 

"  i  +  (si:e-  I)  ”  r.-l )(  I  +  I)  1  *•  ^^^,1  +(5irf-  I) 


The  \VW  Held  determines  ifthe  256-bit  contents  ol'wrA  and  wrB  are  treated  as  32  b>  tes,  16  half-words,  or  8  words.  The  aggregate  sums  of 
the  aligned  data  fields  of  wrA  and  w  rB  are  placed  into  wrD,  subject  to  participation. 

Other  registers  altered: 

•  If  C 1 ,  WideWord  condition  code  registers:  LT,  (!T,  EQ,  CA 

•  .A  WideWord  OV  condition  cikIc  bit  is  set  ifthe  operation  in  its  corresponding  datapath  causes  overllow. 
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waildtvv  -  WkIcWord  Add  Extended 


WidcWord  Unit 

wadde/;H’  wrD,  w  rA,  wrB  (C  =  0) 
\vaddee/;n’  wrD,  wrA,  wrB  (C  =  1) 


Variable  \  allies  in  the  follow  ing  equations  are  as  follows: 


tor  i  =  0  to  (256  -  size)  by  size 

il'PP  bits  and  conditions  are  set  accordingly 


(lIT.-l  +  I,  +  1)  ^  ■‘*1/8 


■phe  W\V  field  determines  if  the  256-bit  contents  ol'vs  rA  and  wrti  are  treated  as  32  bi  tes,  16  hall'-vvords.  or  8  words.  The  aggregate  sums  of 
the  aligned  data  fields  of  wr.'\  and  wrB  are  placed  into  wrU,  subject  to  participation.  I-ach  data  field  uses  the  associated  bit  of  the  WideWord 
Carr\  register  as  a  carr\  m  for  the  operation. 

Other  registers  altered: 

•  I f  C  =  I ,  WideWord  condition  code  registers:  LT,  CiT,  EQ,  CA 

•  A  WideWord  OV  condition  code  bit  is  set  if  the  operation  in  its  corresponding  datapath  causes  ov  erllow. 


waiiiLv  -  WiileWortI  AND 


WidcWord  Unit 

waiul/;w’  wrD,  vvrA,  >vrB  (C  =  0) 
\vaiult7;H’  wrD,  wrA,  wrB  (C  =  1) 


000010 

wrD 

wrA 

wrB 

Q 

PPWW 

lOIOOO 

0  56  10  II  15  16  20  21  22  25  26  31 


Variable  \  alucs  in  the  following  equations  are  as  follows: 


WAV  Value 

size 

00 

8 

01 

16 

10 

32 

for  i  =  0  to  (256  -  size)  b\  size 

if  PP  bits  and  conditions  are  set  accordingly 

”  +  1)  ^  +  1)  ^  +  I) 

The  256-bit  contents  of  wrA  are  ANDed  with  the  256-bit  contents  of  wrB,  and  the  result  is  placed  into  wrD,  subject  to  participation  The 
WW  field  simply  effects  how  participation  applies  and  how  condition  codes  are  ujxlated  for  this  operation 

Other  registers  altered: 

•  If  C  =  I,  WideWord  condition  code  registers:  LT,  CiT,  EQ 
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wfabsuV-  WidcWortI  Floating-Point  Absolute  Value 

WideWord  Unit 

wfabs/;  wrD,  wrA  (C  =  0) 
wfabse/;  wrD,  wrA  (C=l) 


011101 

wrD 

wrA 

00000 

0 

PPIO 

000101 

0  56  10  II  15  16  20  21  22  25  26 


for  i  =  0  to  224  by  32 

if  PP  bits  and  conditions  are  set  accordingly 

ht/\,  +  31  I* “■'■■■*  Vy  +  5i|  (I'sing  lloating-fHrint  arithmetic) 

The  256-bit  contents  of  wrA  are  treated  as  8  single-precision  lloating-point  operands.  The  aggregate  absolute  \alues  of  the  floating-point 
ojvrands  of  wrA  are  placed  into  w  rD,  subject  to  partici|xition. 

Other  registers  altered: 

•  If  C  =  1,  WideWord  condition  code  registers:  LT,  (iT,  EQ 

•  FPSR  may  also  be  uixiated  if  any  floating-point  exceptions  occur. 


126 


wfatklv  -  WicleWord  Floating-Point  Add 


WideWord  Unit 

wfadd/;  wrD,  wrA,  wrB  (C  =  0) 
wfaddty;  wrD,  vvrA,  >vrB  (C  =  1) 


OIIIOI 

wrD 

wrA 

wrB 

0 

PPIO 

000000 

0  56  10  II  15  16  20  21  22  25  26  31 


lor  i  =  0  to  224  by  32 

il'PP  bits  and  conditions  are  set  accordingly 

ht/\,  +  3i  <—  ('•'/•.-I), , +  3,  +  ('•■/•«)(  y  +  3,  (using  noating-iioint  arithmetic) 

The  256-bil  contents  of  wrA  and  wrB  are  treated  as  8  single-precision  lloating-point  operands.  The  aggregate  lloating-p*iint  sums  of  the 
aligned  data  fields  of  wrA  and  wrB  are  placed  into  wrD,  subject  to  participation.  Floating-point  exceptions  may  be  triggered  by  this 
o|x;ralion. 

Other  registers  altered: 

•  IfC  =1,  WideWord  condition  cikIc  registers:  LT,CiT,  FQ 

•  FPSR  may  also  be  u|xlated  if  any  floating-point  exceptions  occur. 
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wfdivA'  -  WiilcWord  Floating-Point  Divide 

WideWord  Unit 

wfdivy;  wrD,  wrA,  wrB  (C  =  0) 

wfdivc/;  wrD,  wrA,  wrB  (C  =  1 ) 


OIIIOI 

wrD 

wrA 

wrB 

0 

PPIO 

0001 II 

0  56  10  II  15  16  20  21  22  25  26 


tor  i  =  0  to  224  by  32 

ifPP  bits  and  conditions  are  set  accordingly 

ht/\,  +  3i  *"'■■■^•<.1  +  31  *’*■'■^•<.1  +  31  (using  lloating-point  arithmetic) 


The  256-bit  contents  of  wrA  and  w  rB  are  treated  as  8  single-precision  lloating-point  operands.  The  aggregate  floating-point  quotients  of  the 
aligned  data  fields  of  wrA  and  wrB  are  placed  into  wrD,  subject  to  participation.  Floating-pttint  exceptions  may  be  triggered  by  this 
o|x;ration. 

Other  registers  altered: 

•  If  C  =  1 ,  WideWord  condition  code  registers:  LT,  CiT,  F-Q 

•  FPSR  may  also  be  updated  if  an>  floating-point  exceptions  iKCur. 


128 


wfimiLv-  WitleWorcl  Floating-Point  Multiply 

WideWord  Unit 

wfimil/;  wrD,  wrA,  wrB  (C  =  0) 
wfimilc/;  wrD,  wrA,  wrB  (C  =  1) 


OlllOI 

wrD 

wrA 

wrB 

0 

PPIO 

0001 10 

0  5  6  10  1 1  15  16  20  21  22  25  26  31 


for  i  =  0  lo  224  b\  32 

il'PP  bits  and  conditions  are  set  accordingly 

iit/\,  +  3i  <—  ("  r.-t ,  +  3,  ^3.3,  (using  floating-point  arithmetic) 


The  256-bit  contents  of  w  rA  and  \s  rB  are  treated  as  8  single-precision  floating-point  operands.  The  aggregate  fli>ating-point  products  of  the 
aligned  data  fields  of  wrA  and  wrB  are  placed  into  wrD,  subject  lo  participation.  l  loating-rK)int  exceptions  may  be  triggered  by  this 
oivration. 

Other  registers  altered; 

•  IfC  I,  WideWord  condition  code  registers:  LT,  (IT,  EQ 

•  FPSR  ma\  also  be  u(xiated  if  an>  floating-point  exceptions  occur. 
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wfiic^v  -  WkleWord  Floating-Point  Negate 

WideWord  Unit 

wfiieg/;  wrD,  vvrA  (C  =  0) 

wfnege/;  wrD,  w  rA  (C=l) 


OIIIOI 

wrD 

wrA 

00000 

0 

PPIO 

000100 

0  56  10  II  15  16  20  21  22  25  26 


for  i  =  0  to  224  b\  32 

ifPP  bits  and  conditions  are  set  accordingly 

,  +  3, « — •('•r.-Dj  ^^.3,)  (using  tloating-point arithmetic) 

The  256-bit  contents  ofwrA  are  treated  as  8  single-precision  floating-point  operands.  The  aggregate  negations  of  the  lloating-ixniit  operands 
of  wrA  are  placed  into  wrD,  subject  to  participation. 

Other  registers  altered; 

•  I  f  C  I ,  WideWord  condition  code  registers:  LT,  (ff,  EQ 

•  EPSR  may  also  be  u|xiated  if  any  floating-ptiint  exceptions  occur. 


130 


wfsubA*  -  WiilcWord  Floating-Point  Subtract 


WideWord  Unit 

wfsiib/;  wrD,  vvrA,  wrB  (C  =  0) 
wfsubc/;  wrD,  vvrA,  wrB  (C  =  1) 


OIIIOI 

wrD 

wrA 

wrB 

0 

PPIO 

000001 

0  56  10  II  15  16  20  21  22  25  26  31 


for  i  =  0  to  224  b\  32 

it'  PP  bits  and  conditions  are  set  accordingly 

nT/>i  ,  +  31  ^  (''■/•-•I),  j  +  3,  -(n  r/}),  ,  +  3|  (using  tloating-|wint  arithmetic) 

The  256-bit  contents  of  wrA  and  wrB  are  treated  as  8  single-precision  tloating-poinl  o|x:rands.  The  aggregate  tloating-i>oint  dilVerences  of 
the  aligned  data  fields  of  wrA  and  vvrB  are  placed  into  wrD,  subject  to  participation.  Floating-point  exceptions  may  be  triggered  by  this 
o|x:ration. 

Other  registers  altered: 

•  If  C  =  I ,  WideWord  condition  code  registers:  l.'f,  (iT,  1;Q 

•  FPSR  ma\  also  be  updated  if  an\  floating-point  exceptions  occur. 
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wftlv  -  WitIcWoril  Floating-Point  to  Integer 


WideWord  Unit 

wfti/;  wrD,  wrA 

wftic/;  wrD,  vvrA 


(C  =  0) 
(C  =  1) 


OIIIOI 

wrD 

w  r  A 

00000 

0 

PPIO 

000010 

0  56  10  II  15  16  20  21  22  25  26  31 


lor  i  =  0  to  224  by  32 

il'PP  bits  and  conditions  are  set  accordingly 

ht/\,  +  3|  inKlM  r/O^  ^3.,,)  (assuming  tloating-point  input  operand) 

The  256-bit  contents  ol'wrA  are  treated  as  8  single-precision  lloating-point  o|x;rands.  Kach  single-precision  floating-point  operand  is  con- 
\erted  to  a  32-bit  integer,  and  the  aggregation  of  these  8  integers  are  placed  into  wrD,  subject  to  participation  I'loating-piiint  exceptions  may 
be  triggered  by  this  operation. 

Other  registers  altered: 

•  IfC  =1,  WideWord  condition  crxle  registers:  LT,(iT,  KQ 

•  FPSR  may  also  be  u|xlated  if  any  tloating-point  exceptions  occur. 
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witfA'  -  WidcWord  Integer  to  Floating-Point 

WideWord  Unit 

witf/;  >vrD,  w  rA  (C  =  0) 

witfc/;  wrD,  vvrA  (C  =  l) 


OIIIOI 

wrD 

w  rA 

00000 

0 

PPIO 

0000 II 

0  56  10  II  15  16  20  21  22  25  26  31 


lor  i  =  0  to  224  b\  32 

il'PP  bits  and  conditions  are  set  accordingly 

1  +  21  ^  +  31^  (assuming  integer  input  operand) 

The  256-bit  contents  ot'wrA  are  treated  as  eight  32-bit  integer  operands.  I:ach  integer  o|x:rand  is  converted  to  a  singe-precision  floating¬ 
point  number,  and  the  aggregation  ot'these  8  single-precision  lloatmg-pomt  numbers  are  placed  into  wrD,  subject  to  participation.  Floating¬ 
point  exceptions  may  be  triggered  b\  this  operation. 

Other  registers  altered: 

•  I f  C  =  I ,  WideWord  condition  code  registers;  LT,  CiT,  EQ 

•  FPSR  may  also  be  u|xlated  if  an\  floating-point  exceptions  occur. 
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wkl  -  Load  WidcWord  Register 

WideWord  Unit 

wld  wrD,  rA,  offset 


IIOIOO 

wrD 

rA 

olVsct 

0  56  10  II  15  16 


l-A  <r-  O.vFFFFFFEO  a  ((rA )  +  ((o/Jscig)'^  ||  of/seD) 

Tho  16-bit  otlset  is  sign-extended  and  added  to  the  contents  ot'rA  to  form  the  eflective  address  EA.  The  256-bjt  \  alue  at  the  memory  liKation 
specified  b>  KA  (ignoring  the  least  five  significant  bits  to  ensure  a  256-bit  aligned  address)  is  then  loaded  into  wrD. 

Other  registers  altered: 

•  None 
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wnirg-v-  WidcWoril  Merge 

WideVVord  Unit 

wnirgcy;  wrD,  \\  rA,  wrB  (C  =  0) 
wmrgecy;  wrD,  vvrA,  wrB  (C  =  1) 


000010 

wrD 

wrA 

wrB 

B 

PPWW' 

lOIIII 

0  56  10  II  15  16  20  21  22  25  26  31 


Variable  \  aliics  in  the  following  equations  are  as  follows: 


W  W  >alur 

cc 

Miieiiioitic  (r) 

00 

EQ 

eq 

01 

LT 

ll 

10 

GT 

gl 

11 

M 

m 

for  j  =  0  to  248  by  8 

il'PP  bits  and  conditions  are  set  accordingly 

=  I 

else 


Each  bit  of  the  WideWord  condition  cixle  register  specified  b\  the  WW  bits  of  the  instruction  ser\es  as  a  selector.  If  the  bit  is  I,  the  corre¬ 
sponding  byte  contents  of  wrA  are  placed  into  the  corresponding  b>te  lane  of  wrD,  subject  to  panicipation.  If  the  bit  is  0,  the  corres|7onding 
b\  te  contents  of  wrB  are  placed  into  the  corres|X)nding  byte  lane  of  wri!),  subject  to  participation. 

Other  registers  altered: 

IfC  I,  WideWord  condition  code  registers:  ET,  (iT,  EQ 
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Winnies  -  WiilcWoril  Multiply  Even  Signed 

WideVVord  Unit 

winiiles/;ii’  wrD,  wrA,  wrB 


000010 

wrD 

wrA 

wrB 

PPWVV 


20  2 1  22  25  26 


lOOIlO 


Variable  \  allies  in  the  follow  ing  equations  are  as  follows; 


for  i  =  0  to  (256  -  2  x  size  )  by  2  x  size 

if  PP  bits  and  conditions  are  set  accordingly 


latch  even-numbered  signed-integer  b\te  or  half-word  of  wrA  is  multiplied  by  the  corresponding  signed-mleger  byte  or  half-word  of  wrB, 
where  the  VVW  field  determines  if  the  256-bit  contents  of  wrA  and  wrB  are  treated  as  bytes  or  half-words.  The  resulting  signed  halfword  or 
word  products  are  placed,  in  the  same  order,  into  w  rD,  subject  to  participation.  No  condition  crxles  are  updated  as  a  result  of  this  operation 

Other  registers  altered: 

•  None 
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wmulcii  -  WideWortI  Multiply  Even  Unsigned 


WideWord  Unit 

wiiinleu/7M’  >vrD,  wrA,  wrB 


000010 

wrD 

wrA 

wrB 

□ 

PPWVV 

lOOIlO 

0  56  10  II  15  16  20  21  22  25  26  31 


V'ariablc  \  aliics  in  the  Ibllowing  equations  are  as  follows: 


\\A\  \iilut 

<ii2e 

01 

8 

10 

16 

for  I  0  to  (256  -  2  X  size  )  by  2  x  size 

ifPP  bits  and  conditions  are  set  aecordinely 


II  r/  ); 


/.  I  +  ( 2  X  size  -  1 ) 


•  I  ii  r.-l ) 


/.  I  +  {size  -  1 ) 


x(Hr/n, 


i.i  +  {size-  1) 


liach  e\en-numbered  unsigned-integer  b>te  or  hall-word  of  wrA  is  multiplied  by  the  corresponding  unsigned-integer  byte  or  half-word  of 
wrB.  where  the  W\V  field  determines  if  the  256-bit  contents  of  wrA  and  wrB  are  treated  as  bytes  or  half-words.  The  resulting  unsigned  half¬ 
word  or  \Nord  prixlucts  are  placed,  in  the  same  order,  into  w  rD,  subject  to  participation.  No  condition  codes  are  ujsdated  as  a  result  of  this 
o|V’ration. 

Other  registers  altered; 

•  None 
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winulos  -  WidcWord  Multiply  Odd  Signed 

WideWord  Unit 

wimilos/;M’  wrD,  wrA,  wrB 


000010 

wrD 

wr.<\ 

wrB 

PPWVV 


20  2122  25  26 


Variable  values  in  the  follow  ing  equations  are  as  follows: 


for  i  =  0  to  (  256  -  2  x  size  )  b\  2  x  size 

if  PP  bits  and  conditions  are  set  aecordingly 

”  ''^\l  +  (2X5iZf-  1)  ^  K+sl2e.t  +  (2xsi:e-  I)  ^  l +(2  x  j/r?  -  1) 


Kach  odd-numbered  signed-integer  byte  or  balf-word  of  wrA  is  multiplied  by  tbe  corresponding  signed-integer  b\te  or  half-\sord  of  wrB, 
where  the  VVVV  field  determines  if  the  256-bit  contents  of  wrA  and  wrB  are  treated  as  b\tes  or  half-words.  The  resulting  signed  halfword  or 
word  products  are  placed,  in  the  same  order,  into  w  rD,  subject  to  participation.  No  condition  crxles  are  updated  as  a  result  of  this  operation 

( )ther  registers  altered: 

•  None 
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wiiiiiloii  -  WklcWord  Multiply  Odd  Unsigned 

WideWord  Unit 

>vimilou/;»v  wrD,  wrA,  wrB 


000010 

wrD 

wrA 

wrB 

□ 

PPWVV 

loom 

0  5  6  10  1 1  15  16  20  21  22  25  26  31 

Variable  \  aliies  in  the  following  equations  are  as  follows; 


WW  >atu« 

size 

01 

g 

10 

16 

for  i  =  0  to  (256  -  2  x  size  )  by  2  x  size 

if  PP  bits  and  conditions  are  set  accordingly 

"  +  I)  ^  (li  r.-l  1, |  +  (2  Xji;,  -  1)  + I) 


Ivach  odd-numbered  unsigned-integer  byte  or  half-word  of  wrA  is  multiplied  by  the  corresponding  unsigned-integer  byte  or  half-word  of 
wrB,  where  the  WW  field  determines  if  the  256-bit  contents  of  wrA  and  wrB  are  treated  as  bytes  or  half-words.  The  resulting  unsigned  half¬ 
word  or  word  products  are  placed,  in  the  same  order,  into  w  rD,  subject  to  participation.  No  condition  crxles  are  uixlated  as  a  result  of  this 
ofieration. 

Other  registers  altered: 

•  None 
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wnotv  -  WicIcWortI  NOT 


WideWord  Unit 

wrD,  wrA 
wiiotc/;H’  wrD,  w  rA 


(C  =  0) 
(C=l) 


000010 

wrD 

wr.'\ 

00000 

D 

PPWW 

lOlllO 

0  5  6  10  1 1  15  16  20  21  22  25  26  31 


Variable  \  alues  in  tlie  follow  ing  equations  are  as  follows: 


WW  \alu« 

size 

00 

8 

01 

16 

to 

32 

fori  0  to  (256  -  size)  by  size 

ifPP  bits  and  conditions  are  set  accordingly 

“■'■/A.  /  (size  -  1 )  «-  -■*  *<.  I  +  (stz^  - 1 ) 

The  256-bit  contents  ofwrA  are  bitwise  inverted,  and  the  result  is  placed  into  wrD,  subject  to  participation.  The  \VW  field  simply  elTects 
how  participation  applies  and  how  condition  codes  are  updated  for  this  operation. 

Other  registers  altered: 

•  If  C  =  1,  WideWord  condition  crxie  registers:  l.T,  (ff,  liQ 
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wo  FA'  -  WkleWord  OR 


WideWord  Unit 

wor/;H’  wrD,  wrA,  wrB  (C  =  0) 

wort7;M’  wrD,  wrA,  wrB  (C  =  1) 


000010 

wrD 

wrA 

wrB 

0 

PPWW 

lOllOO 

0  56  10  II  15  16  20  21  22  25  26 


Variable  values  in  the  following  equations  are  as  follows: 


W  W  Value 

size 

00 

g 

01 

16 

10 

32 

for  i  =  0  to  (256  -  si/e)  b\  si/e 

ifPP  bits  and  conditions  are  set  accordinulv 


trW), 


Li  +  (size~  1)  ' 


'  ,  +  (s,:f  -  I  +  luze  -  1 ) 


The  256-bit  contents  of  w  rA  are  ORed  with  the  256-bjt  contents  of  wrET  and  the  result  is  placed  into  wrD,  subject  to  participation  The  WW 
field  simph  etTects  how  participation  applies  and  how  condition  crxles  are  updated  for  this  operation. 

Other  registers  altered: 

•  IfC  =1,  WideWord  condition  code  registers:  LT,  (JT,  IvQ 


141 


wpksA'  -  WicIcWord  Pack  Signed 


WideVVord  LJnit 

wpksM’  wrD,  wrA,  wrB 


000010  wrD  wrA 

0  56  10  II  15  16 

Variable  \  aloes  in  the  following  equations  are  as  follows: 


wrB  0  OOWW 

20  21  22  26  27 


001110 


\VW  >aluc 

M/C 

mill 

max 

01 

16 

_2^ 

2^-1 

10 

32 

-2'-' 

2'^-l 

lor  i  —  0  to  ( 128  -  (sizc/2) )  b\  (size  2) 

2.  0X2) +  I 

else  if  <H/'.'1),x2.0x2)  +  si/c-I 

"  '■fV/  +  (azc/2)-l 

else 

"  '’^\i  +  (5i/e/2)-  I  i”  '■■■I  i(/x2)  +  (size/2).(/x2)  +  5i/e-  I 

ir(..r/J),x2.ox2)  +  ««-l<'""’ 

"''■f’l2*  +  /.  l2S  +  /  +  (sizr/2)-  1 

else  if  ("■/•«), X2.0x2)  +  5i/t-l 

''''■f*l2*  +  /.  l2*  +  /  +  (na:/2)- 1  "ra-S 


"  '‘^^2*  +  /.  l2*  +  /  +  (aa:/2)-  I  ^  *"’''^*(ix2)  +  (si»/2).(jx2)  +  a«:-  1 


Let  the  source  \ector  be  the  concatenation  of  the  contents  of  wrA  followed  by  wrB.  Each  signed  integer  half-word  or  word,  as  specified  by 
the  WW  bits,  of  the  source  vector  is  con\  erted  to  a  signed  integer  byte  or  half-word,  res(x;ctively.  If  the  value  of  the  source  element  is  outside 
the  bounds  that  can  be  represented  in  the  w  idth  of  the  result  element,  the  result  saturates  to  the  minimum  or  maximum  value  appropriately  . 
The  aggregate  result  is  placed  into  wrD.  Note  that  participation  is  not  suppirrted  for  this  instruction. 

Other  registers  altered: 

•  None 
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wpkiLV-  WidcWoril  Pack  I  iisigiicil 

WidcWord  Unit 

wpkiiw’  wrD,  w  rA,  wrB 


000010 

wrD 

wrA 

vvrB 

□ 

ooww 

001 110 

0  56  10  II  15  16  20  21  22  26  27  31 


Variable  \  allies  in  the  following  equations  are  as  follows: 


W  W  Value 

sa/e 

max 

01 

16 

2'*  1 

10 

32 

2‘*-l 

lor  i  =  0  to  ( 128  -  (sizc/2)  )  b>  (size/  2) 

il  ( '‘  r.-l ),  x2,  (/x2) +SUC- 1  ^ 

else 

"'''^V/  +  (si/r/2)-  1  ^  *“’''-'*\/x2)  +  (sia:/2).(<x2)  +  a/c-  I 
if  (Ii  r/<),x2.  (/x2)  +  5i«- !>"’=« 

"■'■^^2*  +  /.  n«  +  i  +  (sa/2)  -  I 

else 

"  ''^^2*  +  /.  l2*  +  /  +  (nzc/2)-  I  *“  *"'''^^’(ix2)  +  (nzc/2).(/x2)  +  a/c-l 

Let  the  souree  vector  be  the  concatenation  of  the  contents  of  wrA  followed  b\  wrB.  tiach  unsigned  integer  half-word  or  word,  as  specified 
b\  the  \V\V  bits,  of  the  source  vector  is  converted  to  an  unsigned  integer  bv  te  or  half-word,  respeetivelv.  If  the  value  of  the  souree  element  is 
greater  than  the  maximum  value  that  can  be  represented  in  the  width  of  the  result  element,  the  result  saturates  to  the  maximum  value  The 
aggregate  result  is  placed  into  wrD  Note  that  participation  is  not  supported  for  this  instruction. 

Other  registers  altered: 

•  None 
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wpriiLV  -  WidcWord  Pcriiiutc 

WidcWord  Unit 

wpriii/;  wrD,  wrA,  wrB 


000010 

wrD 

vvrA 

wrB 

0 

PPOO 

001000 

0  56  10  II  15  16  20  21  22  25  26  31 


lor  i  =  0  to  248  b\  8 

>1  +  3., +  7 

il  PP  bits  and  conditions  are  set  accordinjjly 

The  contents  olwrA  are  the  source  vector  Ibr  this  pemtutation  operation.  Bits  3  to  7  of  each  b\te  element  ofthe  contents  of  wrB  are  used  to 
select  a  byte  element  from  the  source  vector  for  each  hyte  element  ofthe  result.  The  result  is  placed  into  wrD,  subject  to  participation. 

Other  registers  altered: 

•  None 
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wpriiiLv-  WidcWortI  Permute  Indireet 

WideWord  Unit 

wpriiii/;  wrD,  wrA,  rB 


000010 

wrD 

w  r  A 

rB 

0 

PPOO 

OOlOOl 

0  56  10  II  15  16  20  21  22  25  26  31 


riic  tbilowing  lookup  tabic  is  used  for  selecting  a  permutation  v  ector: 


index 

vector 

0x00 

0x000 1 02030405060708090 AOBOCODOEOFl  0 1 1 1 2 1 3 1 4 1 5 1 6 1 71 8 1 9 1  Al  B 1 C 1 D 1 E 1 F 

0x01 

OxOlO2O3O4O506O708O90A0BOCODOEOFlOll  12131415161718191 AIBICIDIEIFOO 

0x02 

OxO2O3O4O5O6O7O809OAOBOCODOEOFlO1112131415161718191AlBlClDlElFOO01 

0x03 

OxO3O4O5O6O7O8O9OA0B0CODOEOFlO1112131415161718191AlBlClDlElF000l02 

0x04 

0xO405060708090A0B0C0DOE0FlOll  12131415161718191 A1B1CID1E1F00010203 

0x05 

0x0506O70809OA0B0r0DOE0F1011 12131415161718191 A1BICID1EIF0001020304 

0x06 

Ox06O7O8O9OAOB0CODOEOFlO1112131415161718191AlBlClDIElF000IO2O3O4O5 

0x07 

OxO7O8O9OAOB0tODOEOFlO1112131415161718191AlBlClDlElF00OI02O3O4O5O6 

0x08 

Ox08O9OA0BOC0r)OE0FlOlll2131415161718191AlBlClDIElF000IO2O3O4O5O6O7 

0x09 

0xO9OAOBOCODOEOFlOll  12131415161718191 AIBICID1E1FOOOIO203O405O6O7O8 

OxOA 

OxOAOBOC’ODOEOF  1 01 1 1 2 1 3 1 4 1 5 1 6 1 71 8 191 A 1 B 1 C 1 D 1 E 1 FOOO 1 0203040506070809 

OxOB 

OxOBOt'ODOEOFIOl  11213 141516I7I8I9IA1B1CID1EIF000102030405060708090A 

OxOC 

OxOCODOEOFlOl  11213 1415161718191A1B1C1D1E1F000102030405060708090AOB 

OxOD 

OxODOEOF  1 0 1 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 1 A 1 B 1  r  1 D1 E 1  FOOO  1 0203040506O70809OA0B0(' 

OxOE 

OxOEOF  1 0 1 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 1 A 1 B 1 C 1 D 1 E 1  FOOO 1 02030405060708090 AOBOCOD 

OxOF 

OxOF  1 0 1 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 1 A 1 B 1 C 1 D 1 E 1  FOOO  1 02030405060708090 AOBOt'ODOE 

0x10 

0x1011 12131415161718191  AlBlClDlElF0O01020304O506O708090A0B0('0D0E0F 

0x1 1 

0x1 1 12131415161718191  A1B1C1D1E1F000102030405060708090AOBOC'ODOEOF10 

0x12 

0xl2131415161718191AlBlClDlElF0OO1020304O506O708090A0B0C0D0E0FlOll 

0x13 

0x131415161718191  AlBlClDlElFOOOlO2O3O4O5O6O7O8O9OA0B0(0DOE0FlOl  112 

0x14 

Ox  1 4 1 5 1 6 1 7 1 8 1 9 1 A 1 B 1 C 1 D 1 E 1  FOOO 1 0203O4O5O6O70809OA0B0C'0D0E0F  1 0 1 1 1 2 1 3 
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vector 


iiidev 


0x15 

Ox  1 5 1 6 1 7 1 8 1 9 1 A 1 B 1 C 1 D 1 E 1 FOOO 1 02030405060708090 AOBOCODOEOF 1 0 1 1 1 2 1 3 1 4 

0x16 

0x161718191  AIBICID1E1F0O0I020304O506O70809OA0B0COD0E0F1011I2I3I4I5 

0x17 

Ox  1 7 1 8 1 9 1 A 1 B 1 C 1 D 1 E 1  FOOO  1 02030405060708090 AOBOCODOEOF  1 0 1 1 1 2 1 3 1 4 1 5 1 6 

0x18 

Oxl8191AlBlClDlElF000102030405060708090AOBOCODOEOF101 112 1314151617 

0x19 

Ox  1 9 1 A 1 B 1 C 1 D 1 E 1  FOOO  1 020304O5O6O70809OA0B0C0D0E0F 1 0 1 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 

OxlA 

Ox  1 A 1 B 1 C 1 D 1 E 1  FOOO 1 02030405060708090A0B0C0O0E0F 1 0 1 1 1 2 1 3 1 4 1 5  1 6 1 7 1 8 1 9 

OxlB 

OxlBlClDlElF000102030405060708090AOBOCODOEOF101 11213 1415161718191 A 

OxlC 

OxlClDlElF000102030405O60708090A0BOC0DOE0F101 1 12131415161718191 AIB 

OxlD 

OxlDlElF000l020304O506O7O809OAOBOC0DOEOFlOI  1 12131415161718191 AIBIC 

OxlE 

OxlElF000102030405060708090AOBOCODOEOF101 11213 1415161718191A1B1C1D 

OxlF 

OxlF000102030405060708090AOBOrODOEOF1011 12131415161718191 AIBICIDIE 

0x20 

Ox0OO2O406O8OAOCOE10121416181AlClE01O3O5O709OBODOF11131517191BlDlF 

0x21 

Ox0100O3020504070609080B0A0DOC0F0ElllO13121514171619181BlAlDlClFlE 

0x22 

Ox03020100070605040BOA09080FOEODOC13121110171615141BlA19181FlElDIC 

0x23 

0x0706050403020 1 OOOFOEO  DOCOBO  A0908 1 7 1 6 1 5 1 4 1 3 1 2 1 1 1 0 1 F 1 E 1 D 1 C 1 B 1 A 1 9 1 8 

0x24 

OxOFOEODOCOBO  A09080706050403020 1 00 1 F 1 E 1 D 1 C 1 B 1 A 1 9 1 8 1 7 1 6 1 5 1 4 1 3 1 2 1 1 1 0 

0x25 

Ox  1 F 1 E 1 D 1 C 1 B 1 A 1 9 1 8 1 7 1 6 1 5 1 4 1 3 1 2 1 1 1 OOFOEODOCOBO  A09080706050403 020 1 00 

0x26 

Ox00O201030406O507080A09OB0C0E0DOFIOI2ll  1314161517181 A191B1C1E1D1F 

0x27 

0x00040 1 050206O307080C09OD0A0E0B0F 1 0 1 4 1 1 1 5 1 2 1 6 1 3 1 7 1 8 1 C 1 9 1 D 1 A 1 E 1 B 1 F 

0x28 

0x00080 1 09020A030B040r050D060E070F  1 0 1 8 1 1 1 9 1 2 1 A 1 3 1 B 1 4 1 C 1 5 1 D 1 6 1 E 1 7 1 F 

0x29 

0x000 1 040508090C0D 1 0 1 1 1 4 1 5 1 8 1 9 1 C 1 D020306070 AOBOEOF 1 2 1 3 1 6 1 7 1 A 1 B 1 E 1 F 

0x2A 

0x0203000 1 060704050  A0B08090E0F0C0D 1 2 1 3 1 0 1 1 1 6 1 7 1 4 1 5 1 A 1 B 1 8 1 9 1 E 1 F 1 C 1 D 

Ox2B 

0x060704050203 000 1 OEOFOCODO  A0B0809 1 6 1 71 4 1 5 1 2 1 3 1 0 1 1 1 E 1 F 1 C 1 D 1 A 1 B 1 8 1 9 

Ox2C 

0x0E0F0C0DOA0B08O9O60704O502030OO  1 1 E 1 F 1 C 1 D 1 A 1 B 1 8 1 9 1 6 1 7 1 4 1 5 1 2 1 3 1 0 1 1 

0x2D 

Ox  1 E 1 F 1 C 1 D 1 A 1 B 1 8 1 9 1 6 1 7 1 4 1 5 1 2 1 3 1 0 1 1  OEOFOCODO  AOB0809060704050203000 1 

Ox2E 

Ox00O10405020306O708090C0D0A0BOEOFlO111415121316171819IClDlAlBlElF 

0x2F 

0x000 1 080902030 AOB04050COD06070EOF 1 0 1 1 1 8 1 9 1 2 1 3 1 A 1 B 1 4 1 5 1 C 1 D 1 6 1 7 1 E 1 F 

0x30 

Ox00O1020308090A0B101 11213  18191 A1B0405O6O7OC0DOE0F141516171CID1E1F 

0x31 

0x04050607000 1 02030<'0D0E0F08090 AOB 1 4 1 5 1 6 1 7 1 0 1 1 1 2 1 3 1 C 1 D 1 E 1 F 1 8 1 9 1 A 1 B 

0x32 

0x0C0D0E0F08090A0BO40506O7O00102031ClDIElF18191AlB1415161710111213 

0x33 

0xlClDlElF18191AlB141516171011121 3OCODOEOFO809O  A0B04050607000 1 0203 

0x34 

0x000 1 020308090  A0B040506070C0DOEOF 1 0 1 1 1 2 1 3 1 8 1 9 1 A 1 B 1 4 1 5 1 6 1 7 1 C 1 D 1 E 1 F 

0x35 

Ox0001020310111213040506071415161708090AOB18191AlBOCODOEOFlCIDlElF 

index  xrelor 

0x36  Oxl01ll2130OOIO2O314151617O4O5O6O7l8l91AIB08O9OA0BIClDlEIF0C0DOE0F 

0x37  Ox08090AOBOCODOEOF0001020304050607l8l9IAlBlClDIEIF10l  1 1213 14151617 


index «—  31 

permveciar  <—  veclor\ index  | 
tor  i  =  0  to  248  by  8 


-V  <—  permveelori  +  3^  +  7 

it'  PP  bits  and  conditions  are  set  accordingly 


'jXS.CjXSI+T 


The  contents  ofwrA  are  the  source  vector  for  this  permutation  operation.  The  permutation  vector  is  selected  from  a  lookup  table  using  the 
least  significant  bits  of  the  contents  of  rB  as  an  index  into  the  tabte.  Bits  3  to  7  of  each  byte  etement  of  the  permutation  vector  are  used  to 
select  a  byte  element  from  the  source  vector  for  each  byte  element  of  the  result,  fhe  result  is  placed  into  vvrD,  subject  to  participation 

Other  registers  altered: 


•  None 
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wsILv  -  WideWoril  Shift  Left  Logical 

WideWord  Unit 

\vsII/;h’  wrD,  wrA,  wrB  (C  =  0) 

wsllc/;!!’  wrD,  wrA,  vvrB  (C  =  l) 


V'ariabic  \aliies  in  the  (bllow  ing  equations  are  as  follows: 


The  VVVV  field  determines  if  the  256-bit  contents  of  wrA  and  wrIS  are  treated  as  32  bytes,  16  half-words,  or  8  words  The  contents  of  each 
data  field  of  wrA  are  shifted  left  by  the  number  of  bits  specified  by  the  low  order  bits  of  the  corres|'Hmdmg  data  field  contained  as  contents 
of  wrB,  inserting  zeros  into  the  low  order  bits  of  each  data  field  of  the  result,  fhe  result  is  placed  into  wrD,  subject  to  participation. 

Other  registers  altered: 


•  I  f  C  I ,  WideWord  condition  code  registers:  LT,  (JT,  TQ 


wslllv  -  WidcWortI  Shift  Left  Logical  Iminediatc 

WideWord  Unit 

wslli/;M’  wrD,  wrA,  shift  aniount  (C  =  0) 

wsllic/;M’  wrD,  \v rA,  shift_aiiioiiiit  (C  =  1 ) 


Variable  values  in  the  following  equations  are  as  follows; 


fori  0  to  (256  -  size)  by  si/e 

if  PP  bits  and  conditions  are  set  accordingly 


■  1) 


1)11® 


The  WVV  field  determines  if  the  256-bit  contents  ot'wrA  are  treated  as  32  by  tes,  16  half-words,  or  8  words.  The  contents  of  each  data  field 
of  w  rA  are  shifted  left  by  the  number  of  bits  specified  by  the  appropriate  bits  of  the  shift  amount,  inserting  zeros  into  the  low  order  bits  of 
each  data  field  of  the  result.  The  result  is  placed  into  wrD,  subject  to  participation. 

Other  registers  altered: 


If C  I ,  WideWord  condition  cixle  registers:  LT,  (JT,  EQ 


wsriLV  -  WitIcWoril  Shift  Right  Arithmetic 

WideWord  Unit 

wsra/;M’  wrD,  wrA,  wrB  (C  =  0) 

wsrac/;ir  wrD,  wrA,  wrB  (C=l) 


000010 

wrD 

w  r.A 

wrB 

0 

PPWW 

000101 

0  56  10  II  15  16  20  21  22  25  26  31 


Variable  \  aloes  in  the  Ibllowing  equations  are  as  follows: 


WAV  V  alue 

size 

bits 

oo 

s 

3 

01 

16 

4 

to 

32 

5 

lor  i  =  0  to  (256  -  size)  b\  size 

il'PP  bits  and  conditions  are  set  accordingly 


iir/), 


1) 


i.i+size^s^  I 


The  VVW  Held  determines  ifthe  256-bit  contents  ofwrA  and  wrB  are  treated  as  32  bytes,  16  half-words,  or  8  words.  The  contents  of  each 
data  field  of  wrA  are  shifted  right  by  the  numlier  of  bits  specified  b\  the  low  order  bits  of  the  corresponding  data  field  contained  as  contents 
of  w  rB,  sign-extending  the  high-order  bits  of  each  data  field  of  the  result.  The  result  is  placed  into  w  rD,  sub  ject  to  participation. 

Other  registers  altered: 


•  If  C  =  1,  WideWord  condition  c«xie  registers:  LT,  (iT,  KQ 
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wsralv-  WideWord  Shift  Right  Arithmetic  Iinmediate 

WideWord  Unit 

wsrai/;>v  wrD,  vvrA,  shift  amount  (C  =  0) 
wsraic/;H’  wrD,  wrA,  shift  a mount  (C  =  1) 


Variable  values  in  the  follow  ing  equations  are  as  follows: 


v  <—  shitt  aiiiountj.  4 

for  i  0  to  (256  -  si/.c)  by  size 

if  PP  bits  and  conditions  are  set  accordinelj' 


1) 


■  ((M-n-l),)  ll(irr.-l). 


The  WVV  field  determines  if  the  256-bit  contents  of  wrA  arc  treated  as  32  b>  tes,  16  half-words,  or  8  words.  The  contents  of  each  data  field 
of  wrA  are  shifted  right  by  the  number  of  bits  specified  by  the  appropriate  bits  of  the  shift  amount,  sign-extending  the  high-order  bits  of  each 
data  field  of  the  result.  The  result  is  placed  into  w  rD,  subject  to  participation. 

Other  registers  altered: 


•  IfC  I ,  WideWord  condition  code  registers:  LT,  (iT,  TQ 
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wsrLv  -  WitIcWorcl  Shift  Right  Logical 


WideWord  Unit 

\vsrl/;»c  wrD,  wrA,  vvrB  (C  =  0) 

wsrlc/;»v  wrD,  wrA,  vvrB  (C  =  1) 


000010 

wrD 

wrA 

wrB 

0 

PPWW 

00000 1 

0  56  10  II  15  16  20  21  22  25  26  31 


V'ariabic  \  aliics  in  the  following  equations  are  as  follows: 


W\V  >  alu» 

size 

bits 

00 

s 

3 

01 

16 

4 

10 

32 

5 

lor  i  =  0  to  (256  -  size)  b)  size 
s  *—  hi  I _  1,11^  I ^ jjj.y _  I 
if  PP  bits  and  conditions  are  set  accordingly 

"  '-‘K.nsize- I)  o'  II  <  "  >/.  /  I 

The  VVW  field  determines  if  the  256-bit  contents  of  \srA  and  wrB  are  treated  as  32  bytes,  16  half-words,  or  8  words.  The  contents  of  each 
data  field  of  wrA  are  shifted  right  by  the  number  of  bits  specified  b\  the  low  order  bits  of  the  corresponding  data  field  contained  as  contents 
of  wrB,  inserting  zeros  into  the  high-order  bits  of  each  data  field  of  the  result.  The  result  is  placed  into  wrD,  subject  to  participation. 

Other  registers  altered: 

•  If  C  I,  WideWord  condition  code  registers:  I.T,  (JT,  E(3 
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wsrILv  -  WideWord  Shift  Right  Logical  Iininediatc 

WideWord  Unit 

wsrli/;w’  wrD,  wrA,  shift  aiiioiiiit  (C  =  0) 
wsrlic/;H’  wrD,  wrA,  shift  aiiioiiiit  (C  =  1) 


000010 

wrD 

wrA 

shift  amount 

0 

PPWW 

00001 1 

0  56  10  II  15  16  20  21  22  25  26 


Variable  \  aloes  in  the  follow  ing  equations  are  as  follows: 


W  W  >alu( 

silt 

biLs 

00 

8 

3 

01 

16 

4 

to 

32 

5 

for  i  =  0  to  (256  -  si/e)  Iw  si/c 

ifPP  bits  and  conditions  are  set  accordingly 

1)  ^  II  <’‘'■'-1  1 

The  WW  field  determines  if  the  256-bit  contents  of  wrA  are  treated  as  32  bytes,  16  half-words,  or  8  words.  The  contents  of  each  data  field 
of  wrA  are  shifted  right  by  the  numl'ier  of  bits  specified  by  the  appropriate  bits  of  the  shift  amount,  inserting  zeros  into  the  high-order  bits  of 
each  data  field  of  the  result.  The  result  is  placed  into  wrD,  subject  to  participation 

Other  registers  altered; 

•  I  f  C  =  I ,  WideWord  condition  ctxle  registers:  LT,  CJT,  liQ 
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wst  -  Store  WitIcWoril  Register 


WideWord  Unit 

wst  wrD,  rA,  offset 


IIOIOI 

wrD 

rA 

offset 

0  56  10  II  15  16 


1:a  <-  O.vFFFFFFEO  a  ( ( r.-t )  +  ( ( ojficig ) '*  ||  of)sei) ) 

MEM|EA|<-iir/) 

The  16-bit  olTsct  is  sign-extended  and  added  to  the  contents  ofrA  to  rorni  the  elVective  address  l:A.  The  256-bit  contents  of  wrD  are  stored 
at  the  nieniotx  Iwation  specified  by  F.A  (ignoring  the  least  five  significant  bits  to  ensure  a  256-bit  aligned  address). 

Other  registers  altered: 

•  None 
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wsubA*  -  WideWord  Subtract 


WideWord  Unit 

wsubpw  wrD,  wrA,  wrB  (C  =  0) 
wsubc/;H’  wrD,  wrA,  wrB  (C  =  1) 


000010 

wrD 

wrA 

wrB 

0 

PPWW 

100010 

0  56  10  II  15  16  20  21  22  25  26  31 


V'ariable  \  aliics  in  the  follow  ing  equations  are  as  Ibllows: 


WAV  Value 

size 

00 

8 

01 

16 

10 

32 

lor  i  =  0  to  (256  -  size)  b\  size 

il’PP  bits  and  conditions  are  set  accordingly 

"  rni, ,  +  ii  r-.-t irr«  +  1 

rhe  WVV  field  determines  if  the  256-bit  contents  ol  wrA  and  wrB  are  treated  as  32  b\  tes.  16  half-words,  or  8  words.  ITie  agureitate  ditfer- 
ences  of  the  aliened  data  fields  of  w  rA  and  w  rB  are  placed  into  w  rD,  subject  to  participation. 

Other  reuisters  altered: 

•  irc  I,  WideWord  condition  code  registers:  I.T,  (IT,  EO,  CA 

•  A  WideWord  OV  condition  cixle  bit  is  set  if  the  operation  in  its  corresponding  datapath  causes  overllow. 
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wsubcvv  -  WideWord  Subtract  Extended 


WideWord  Unit 

\vsiibe/;H’  wrD,  wrA,  wrB  (C  =  0) 
wsiibcc/;H’  wrD,  vvrA,  wrB  (C  =  1) 


000010 

wrD 

wrA 

wrB 

0 

PPWW 

1000 1 1 

0  56  10  II  15  16  20  21  22  25  26  31 


V'ariahic  values  in  the  follow  ing  equations  are  as  follows: 


W  AV  Aalue 

size 

00 

8 

01 

16 

10 

32 

for  i  =  0  to  (256  -  si/e)  by  size 

if  PP  bits  and  conditions  are  set  accordingK 

The  WW'  field  determines  if  the  256-bit  contents  ofw  rA  and  \\  rH  are  treated  as  32  by  tes.  16  half-words,  or  8  words.  The  aggregate  differ¬ 
ences  of  the  aligned  data  fields  of  \\r.'\  and  wrB  are  placed  into  wrD,  subject  to  participation,  liach  data  field  uses  the  assiKiated  bit  of  the 
WideWord  Carry  register  as  a  carry  in  for  the  oc>eration. 

Other  registers  altered: 

•  If  C  =  I ,  WideWord  condition  code  registers:  l.T,  (JT,  T.Q,  CA 

•  .A  WideWord  ()V  condition  code  bit  is  set  if  the  operation  in  its  corresponding  datapath  causes  ov  erflow. 
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wsubu  -  WidcWord  Subtract  Unsigned 


WideWord  Unit 

wsnbn/;»i’  >vrD,  wrA,  wrB 


000010 

wrD 

wrA 

wrB 

0 

PPWW 

100100 

0  56  10  II  15  16  20  21  22  25  26  31 


Variable  values  in  the  follow  ing  equations  are  as  follows: 


WAV  Wilue 

si/f 

00 

8 

01 

16 

to 

32 

for  i  0  to  (256  -  si/e)  by  size 

if  PP  bits  and  conditions  are  set  accordingly 

The  WVV  Held  determines  if  the  256-bit  contents  ol  w  rA  and  wrB  are  treated  as  32  b>  tes.  16  half-words,  or  8  words.  The  aggregate  differ¬ 
ences  of  the  aligned  data  fields  of  wrA  and  wrB  are  placed  into  wrD,  subject  to  panicipation.  This  instruction  is  identical  to  wsub  except  that 
the  OV  condition  codes  are  updated  to  reflect  unsigned  arithmetic. 

Other  registers  altered: 

•  WideVVord  condition  code  registers:  LT,  (IT,  liQ,  CA 

•  .A  WideWord  OV  condition  code  bit  is  .set  if  the  operation  in  its  corresponding  datapath  eauses  os  erllow. 
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wupklLV-  WidcWord  Unpack  High 

WideWord  Unit 

wupkhsiv  wrD,  wrA  (C  =  0) 
wupkhiiH’  wrD,  wrA  (C  =  1) 


000010 

wrD 

wrA 

00000 

D 

OOW'W 

001101 

0  5  6  10  1 1  15  16  20  21  22  26  27  31 


Variable  \  aliics  in  the  Ibllovving  equations  are  as  follows: 


WAV  Value 

sue 

00 

8 

01 

16 

I'or  i  =  0  to  (256  -  (2  x  size) )  b\  (2  x  size) 
ilC=l 

"■'•/\/  +  (2xsue)-l  O"*  II  *1/2. (1/2) +  ««-  1 

else 

"  rlh. ;  +  (2  X  - 1  <-  <  ( "  rA  )^/j )““  II  ( wrA  )(^2  ^  , 

The  most  significant  128  bits  of  the  contents  of  wrA  are  unpacked,  or  type  promoted.  For  e.xample,  if  VVW=00  the  128-bit  source  vector  is 
treated  as  16  bytes,  where  each  byte  is  promoted  to  a  16-bit  half-word  to  form  a  256-bit  result  that  is  plaeed  into  wrD.  The  C  bit  indicates 
whether  sign  e.xtension  or  zero  fill  is  used  in  the  unpacking.  Note  that  participation  is  not  supported  for  this  instruction. 

Other  registers  altered: 

•  None 
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wlipkLv-  WideWord  Unpack  Low 

WideWord  Unit 

wupkIsH’  wrD,  w tA  (C  =  0) 
w  upkluic  w  rD,  vvrA  (C=l) 


000010 

wrD 

wr.A 

00000 

0 

ooww 

001 100 

0  56  10  II  15  16  20  21  22  26  27  31 


Variable  values  in  the  following  equations  are  as  follows: 


WAV  Villut 

size 

00 

8 

01 

16 

for  i  =  0  to  ( 256  -  ( 2  x  size ) )  b>  ( 2  x  size ) 

irc=i 

’''''^\i  +  (2xstn:)-  I  ^  ®  II  • h2*  +  (i/2).  12*  +  ((/2)  +  size  -  1 

else 

'''''^’(.i  +  (2xst/c)-  1  *  *''''''^h2*  +  (//2)’  II  *12* +  0/2).  12*  +  0/2) +  st/c-  I 

The  least  signifieant  128  bits  of  the  contents  of  wrA  are  unpacked,  or  t\  jX’  promoted.  For  example,  if  WW=00  the  128-bit  .source  vector  is 
treated  as  16  bytes,  where  each  byte  is  promoted  to  a  16-bit  half-word  to  form  a  256-bit  result  that  is  placed  into  wrD  The  C  bit  indicates 
whether  sign  extension  or  zero  fill  is  used  in  the  unpacking.  Note  that  participation  is  not  supported  for  this  instruction. 

Other  registers  altered: 

•  None 
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wxor.v  -  WiilcWoril  Exclusivc-OR 


WideWord  Unit 

wxor/;M'  wrD,  vvrA,  wrB  (C  =  0) 
wxorc/;»i’  wrD,  vvrA,  wrB  (C  =  1) 


000010 

wrD 

wrA 

wrB 

0 

PPWW 

lOIOlO 

0  56  10  II  15  16  20  21  22  25  26  31 


Variable  values  in  the  follow  ing  equations  are  as  follows: 


W  W  \Vilue 

size 

00 

8 

01 

16 

10 

32 

for  i  =  0  to  (256  -  size)  by  size 

if  PP  bits  and  conditions  are  set  accordingly 

”  /  +  (5ir,  - 1  +  (s,r,  - 1)  ®  <"  '•/<  *1. 1  +  - 1 ) 

The  256-bit  contents  of  wrA  are  exclusi\e-()Red  with  the  256-bit  contents  of  wrB,  and  the  result  is  placed  into  wrD,  subject  to  participation. 
The  WW  held  simpK  etTects  how  particiiiation  applies  and  how  condition  cixles  are  updated  for  this  operation. 

Other  registers  altered: 

•  I f  C  =  I ,  WideWord  condition  code  registers:  l.T,  CJT,  TQ 


161 


xor.v  -  Exclusive  OR 


Scalar  Unit 

xor  rD,  rA,  rB  (C  =  0) 

xorc  rD,  rA,  rB  (C  =  1) 


000011 

rD 

rA 

rB 

0 

X 

lOIOlO 

0  56  10  II  15  16  20  21  22  25  26  31 


r/)^(r.O®  (/•«) 

The  contents  ot'rA  are  exclusi\e-()Fted  with  rB.  and  the  result  is  placed  into  rD. 
Other  registers  altered: 

•  I rc  =  I,  scalar  condition  ctxle  registers:  LT,(iT,HQ 
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xori  -  Exclusive  OR  Ini mediate 


Scalar  Unit 

xori  rD,  rA,  IMM 


lOIOlO 

rD 

rA 

IMM 

0  56  10  II  15  16  31 


The  contents  ot'rA  are  exclusive-ORed  with  IMM  (prepended  with  zeros  to  form  a  32-bit  \alue),  and  the  result  is  placed  into  rD, 
Other  registers  altered: 

•  None 
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xoric  -  Exclusive  OR  Iiunicdiatc  Rccortliiig  Condition  Codes 

Scalar  Unit 

xorie  rD,  rA,  IMM 


lOIOII 

rD 

rA 

IMM 

0  56  10  II  15  16 


II  IMM) 

The  contents  of  rA  are  exclusive-ORed  with  IMM  (prepended  with  zeros  to  form  a  32-bit  xalue),  and  the  result  is  placed  into  rD. 
Other  reuisters  altered: 

•  Scalar  condition  code  registers:  LT,  (ff,  1:0 
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Motivation 


DIN  A  System 
Architecture 


Chapter  1  -  Introduction  and  Rationale 


rhe  increasing  gap  between  processor  and  niemor\  speeds  is  a  well-known  problem  in  computer  architecture,  with  peak  processor  perfor¬ 
mance  increasing  at  a  rate  of  50-60"o  per  year  while  memory  access  times  improve  at  merely  5-7“b.  Recent  VLSI  technology  trends  olfer  a 
promising  solution  to  bridging  the  priKcssor-memory  gap:  fiiihciUcJ-DliiM lediiiolo^v  integrates  logic  with  high  density  memory  in  a  pro- 
cessing-in-memory  (PIM)  chip.  Because  PIM  internal  processors  can  be  directly  connected  to  the  memory  hanks,  the  memory  bandwidth  is 
dramatically  increased  (with  hundreds  of  gigabit  second  aggregate  bandwidth  available  on  a  chi|v-up  to  2  orders  of  magnitude  over  con\en- 
tional  DRAM)  Latency  to  on-chip  logic  is  also  reduced,  down  to  as  little  as  one  half  that  of  a  conventional  memory  system,  because  internal 
memory  accesses  avoid  the  delays  associated  with  communicating  olf  chip. 

The  Data  Intensive  Architecture  (DIVA)  Project  is  an  e.xploration  of  the  potential  benefits  of  making  direct  use  of  the  high  data  bandwidth 
and  low  access  latency  available  on  memory  dev  ices.  DIVA  leverages  embedded-DRAM  technology  to  replace  or  augment  the  memory  sys¬ 
tem  of  a  conventional  workstation  with  "smart  memorie.s"  capable  of  very  large  amounts  of  prrKessing  System  bandwidth  limitations  are 
thus  overcome  in  three  ways:  ( I )  tight  coupling  of  a  single  PIM  processor  with  an  on-chip  memory  bank;  (2)  distributing  multiple  processors 
and  memory  banks  per  PIM  chip,  and.  (3 )  utilizing  a  separate  chip-to-chip  interconnect,  for  direct  communication  Ivetween  nodes  on  dilTcr- 
ent  chips  that  by  passes  the  host  sy  stem  bus. 

The  DIVA  sy  stem  architecture  is  focused  on  achieving  the  following  four  goals:  ( 1 )  developing  PlMs  that  can  serve  as  the  only  memory  in 
the  system,  assuming  the  dual  roles  of  "smart  memories"  and  conventional  memory;  (2)  supporting  a  wide  range  of  familiar  programming 
paradigms,  closely  related  to  parallel  computing;  (3)  targeting  applications  that  are  .severely  impacted  by  the  processor-memory  bottlenecks 
in  conventional  systems:  sparse-matrix  and  pointer-based  applications  with  irregular  memory  access  patterns,  and  image  and  video  applica¬ 
tions  with  large  working  sets;  and.  (4)  developing  a  VLSI  dev  ice  to  exploit  memory  and  communications  bandwidth  in  PlM-based  systems 
while  making  elTicient  u.se  of  on-chip  resources  for  target  applications. 

In  DIVA,  the  PIM  chips  serve  as  the  memory  to  a  conventional  host  prtKessor.  A  Dl  V.A  system  is  comprised  of  multiple  interconnected  PIM 
chips  (on  the  order  of  32  to  64).  On  each  of  these  PIM  VLSI  dev  ices,  there  may  be  multiple  processors  and  memory  banks.  Lach  PIM  pro¬ 
cessor  has  a  specific  memory  bank  assiKiated  with  it.  We  refer  to  a  single  processor  and  its  associated  memory  bank  as  a  iioJc. 

This  document  describes  the  architecture  of  a  single  mxie  in  the  DIVA  system,  presenting  its  key  com|Tonents  m  detail.  This  chapter  prov  ides 
a  framework  for  understanding  the  role  of  a  DIV.A  node  by  first  describing  the  overall  system  architecture,  as  well  as  the  architecture  of  the 
PIM  VLSI  dev  ice,  followed  by  key  features  of  the  DIVA  system  architecture.  Subsequently,  it  describes  individual  components  of  the  node 
architecture,  to  be  covered  in  much  more  detail  in  later  chapters. 

A  driv  ing  principle  of  the  DIVA  system  architecture  is  efiicient  use  of  PIM  technology  while  requiring  a  smooth  migration  path  for  software. 
This  principle  demands  integration  of  PIM  features  into  conv  entional  systems  as  seamlessly  as  possible.  .As  a  result,  DIV.A  chips  are 
designed  to  resemble  commercial  DRAMs.  enabling  PIM  memory  to  be  accessed  by  host  software  as  if  it  were  conventional  memory.  In  Fig¬ 
ure  I,  we  show  a  small  set  of  PI  Ms  connected  to  a  single  host  processor  through  conventional  memory  control  logic.  Because  of  on-chip 
memory  accesses,  this  memory  controller  can  not  be  a  commercially  available  device,  while  standard  DRAMs  are  “slave"  memories  man¬ 
aged  by  the  host,  active  PIMs  may  have  to  signal  a  "not  ready"  condition  while  access  to  the  memory  array  is  arbitrated. 

Parcels  which  spawn  computation,  gather  results,  synchronize  activity,  or  simply  access  non-UKal  data  are  transmitted  through  a  separate 
PIM-to-PIM  interconnect  to  enable  communication  without  interfering  with  host-memory  trafi'ic  This  interconnect  must  have  low  latency 
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DINA  IMM  (  hip 
AiTliitccturc 


and  high  bandwidth  and  be  amenable  to  the  dense  packing  requirement  ol  iTiemor\  de\  ices.  I'urthermore,  it  must  be  scalable  to  allow  the 
addition  or  remo\al  ol'deMces  from  the  system.  For  system  si/.es  of  the  scale  exf>ected  for  DIVA  (32  to  64  PIM  chips),  this  combination  of 
requirements  favors  a  one-dimensional  network.  The  interconnection  network  of  an  earlier  embedded  scalable  system,  the  Package-Driven 
Scalable  System  (PDSS)  [Steele‘J7|,  is  used  as  a  model.  The  interconnect  is  implemented  by  PIM  Routing  Components  (PiRCs)  -  one  I'ler 
PIM  chip,  flic  PiRC  intcr-chip  fabric  is  then  a  |>omt-to-poinl  bidirectional  ring  using  wormhole  routing  and  the  Red  Rover  routing  algorithm 
[Drapei^6|  to  effect  deadlock-free,  low-latencv  routing  of  fixed-sized  packets.  Future  generations  of  DIVA  systems  will  contain  large  num¬ 
bers  of  PIM  chips  and  w  ill  require  a  more  complex  network  scheme. 


I'imirc  I:  DINA  System  Physical  Organization 


Mach  DIVA  PIM  chip  is  a  VLSI  memory  device  augmented  with  general  and  special-purpose  computing  and  networking  communication 
hardware.  A  PIM  may  consist  of  multiple  tuck’s,  each  of  w  hich  are  primarily  comprised  of  a  few  megabv  tes  of  memory  and  a  node  proces¬ 
sor.  Figure  2  shows  a  PIM  with  four  nodes.  Fhe  ncxles  on  a  chip  share  a  single  PiRC  and  a  host  interface.  The  PiRC  is  res|ionsible  for  routing 
parcels  on  and  off  chip.  Fhe  host  interface  supports  conv  entional  memory  accesses  from  the  host  as  well  as  parcels  initiated  by  the  host. 

Figure  2  also  shows  two  global  interconnects  that  span  the  PIM  chip  for  information  How  between  the  nodes,  the  host  interface,  and  the 
PiRC.  I:ach  interconnect  is  distinguished  by  the  type  of  information  it  carries.  The  PIM  memory  bus  is  used  for  conventional  memory 
accesses  from  the  host  processor.  Fhe  parcel  interconnect  allows  parcels  to  transit  between  the  host  interface,  the  nodes,  and  the  PiRC. 
Within  the  host  interface,  a  parcel  buffer  (PBUF)  prov  ides  a  butfer  that  is  memory  -mapped  into  the  host  processor's  address  space,  permit¬ 
ting  application-level  communication  through  parcels.  Mach  PIM  node  also  has  a  PBCF.  memory -mapped  into  the  inxie's  local  address  space 
(see  discussion  in  next  section).  .Although  the  PiRC  also  contains  parcel  ports,  we  do  not  label  them  PBl.iFs,  as  they  are  not  memory- 
mapped 
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Oven  ic«  of  I)l\'.\ 
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Figure  2:  DIN  A  PFM  (  hip  ()r»ani/ation 


rhe  DIVA  PIM  node  processor  supports  single-issue,  in-order  execution,  with  32-bit  instructions  and  32-bit  addresses.  There  are  two  execu¬ 
tion  units,  or  datapaths:  a  scalar  tialapaili  perTornis  sequential  operations  on  32-bit  registers,  and  a  wuie  Jaiapaili  pertbrnis  fine-grain 
parallel  operations  on  256-bit  register.  Both  scalar  and  w  ide  datapaths  execute  Irom  a  single  instruction  stream  under  the  control  of  a  single 
5-stage  pipeline  The  instruction  set  has  been  designed  so  both  datapaths  can.  for  the  most  part.  u.se  the  same  opcodes  and  condition  codes, 
generating  a  large  funetional  overlap.  Hach  datapath  has  its  own  independent  register  file,  but  special  instructions  permit  direct  transfers 
between  register  files  without  going  through  meinorx. 

The  combination  of  the  execution  control  pipeline  and  scalar  datapath  may  be  v  iew  ed  as  a  conventional  microprocessor  and  mav  be  pro¬ 
grammed  as  such  This  capability  is  essential  to  the  evolutionary  software  development  approach.  Users  may,  with  very  little  etVort.  exploit 
the  coarse-grain  parallelism  offered  by  the  PIM  nodes  by  simply  programming  multiple  mxles  in  a  conventional  sense.  However,  users  may 
also  exploit  fine-grain  parallelism  by  using  the  VVideWord  datapath. 

Although  not  supported  in  the  initial  DIVA  prototype,  lloating-ivivint  functionality  will  be  provided  in  future  systems  as  extensions  to  the 
WideWord  unit  to  operate  on  eight  32-bit  datapaths.  The  lloating-pomt  support  will  lie  mentioned  throughout  this  document,  but  as  it  is  sub¬ 
ject  to  change,  will  not  be  presented  in  significant  detail 

In  addition  to  the  e.xecution  units,  each  DIVA  PIM  node  includes  three  other  units.  A  mcnuiry  iimi  (Ml !)  is  res|vonsible  for  generating  proper 
control  signals  to  the  memory  macro.  Its  functions  include  initiating  refresh  cy  cles  as  needed  and  arbitrating  between  the  host  memory  port 
and  the  execution  control  unit  for  access  to  the  memory  macro;  priority  of  accesses  goes  to  the  host.  Fuilhermore,  it  tracks  and  maintains  an 
o|x;n  row  in  the  DRAM  macro  to  enable  page-mcxle  accesses  as  often  as  possible. A  small  iiisiniciian  cache  (K '/  is  used  to  keep  instruction 
accesses  to  the  memory  macro  from  interfering  with  data  accesses  as  much  as  possible.  Tach  node  contains  a  memory -mapiied  location 
called  a  parcel  buffer  ( PBl.lF )  that  serv  es  as  a  port  between  the  parcel  interconnect  and  the  node,  permitting  efficient  application-level  parcel 
sends  and  receives. 
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Key  Features  of 
DIVA  Node 


Bloek  Diucrum  and 
Deseription  of  Node 
Com  ponents 


This  section  briellv  highlights  the  most  important  features  of  the  DIVA  node  architecture: 

•  1  ligh  bandwidth  and  low  latenc>  access  to  node  memor\ 

The  wide  datapath  permits  memor\  accesses  of  256  bits  with  a  single  load  or  store  operation.  Further,  the  latency  on  these 
memory  accesses  is  quite  low.  Two  consecutive  accesses  to  memory  w  ithin  the  same  2k-bit  row  in  the  memory  cell  will  !■« 
in  page  mexie,  with  a  latency  on  the  order  ofjust  a  few  node  cycles.  If  not  on  the  same  row  in  memory,  accesses  are  in  ran¬ 
dom  mode,  which  is  perhaps  3  times  slower  than  a  (xtge  mode  access,  but  is  still  roughly  3-4  times  faster  than  the  latency 
that  would  be  obser\  ed  in  a  conventional  system. 

•  Standard  scalar  instructions  augmented  with  wide  AITJ  and  memory  operations 

In  addition  to  the  high-bandwidth  memory  operations  described  in  the  previous  paragraph,  the  wide  datapath  enables 
superword-level  parallelism  as  available  in  multimedia  extensions  such  as  MMX  and  AltiVec  on  wide  words  of  256  bits. 

The  functionality  of  the  wide  datapath  is  distingui.shed  from  other  multimedia  and  subword  parallelism  ISAs  in  the  fol¬ 
lowing  ways,  which  will  be  discussed  in  more  detail  later  DIVA  supports  selective  execution  of  instructions  on  sub-fields 
with  a  WideWord,  depending  on  the  state  of  local  and  neighboring  condition  codes;  it  sup|xrrts  direct  transfers  to  from 
other  register  files;  and,  the  wide  datapath  is  tightly  coupled  with  the  mter-chip  communication. 

•  Integrated  scalar  and  wide  datapaths,  using  a  single  control  pipeline 

fhe  scalar  and  wide  datapaths  share  a  single  control  pipeline,  avoiding  complications  in  keeping  two  separate  pipelines 
synchroni/ed.  Direct  transfers  to  from  the  register  tiles  associated  with  each  datapath  facilitate  etlicient  switching 
between  scalar  and  line-grain  parallel  portions  of  the  computation,  fhe  two  instruction  sets  share  most  of  the  same 
oficodes,  and  to  further  unify  the  instruction  sets,  use  the  same  condition  cixies. 

Figure  3  shows  the  major  control  and  data  connections  within  a  node.  Information  Hows  into  and  out  of  the  nixie  via  the  pbuf  orthe  memory 
port.  As  shown  in  the  figure,  arbitration  between  external  memory  accesses  by  the  host  and  node  memory  accesses  is  required.  This  arbitra¬ 
tion  adds  an  insignificant  delay  to  the  host  memory  access  time  when  the  PIM  priKessor  is  not  accessing  memory ;  there  is  little  ditference  in 
performance  when  the  PIMs  are  simply  used  as  conventional  memory.  If  the  PIM  prixressor  is  accessing  memory,  the  host  memory  access 
time  includes  the  additional  latency  of  w  aiting  for  the  PIM  memory  cycle  to  complete. 
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Scalar  Datapath  and 
lixecution  Pipeline 


II  Ide  Datapath 


ri<iurc3:  DIN  A  PIM  Node  Architecture 

The  execution  pipeline  is  shared  between  the  scalar  and  wide  datapaths.  It  is  a  standard  5-stage  pipeline,  with  the  following  stages:  ( I ) 
instruction  fetch;  (2)  register  decode;  (2)  execute;  (4)  niemorx;  and,  (5)  write.  There  are  three  classes  of  pipeline  hazards,  which  result  in  idle 
cycles;  ( I )  long  instruction  sequences,  such  as  for  multiplies  and  di\  ides;  (2)  register  operations,  in\ol\  ing  data  dependences  between  nearIn 
instructions;  and.  (3)  memory  operations,  w  hich  stall  the  pipeline  due  to  multiple  cycles  latency  to  memory  .  The  second  cla.ss  of  hazards  are 
sometimes  av  oided  with  pipeline  forwarding.  Other  hazards  can  only  be  avoided  through  careful  ordering  of  instructions  by  the  compiler. 

ITie  scalar  datapath  is  lor  the  most  part  a  standard  RISC  architecture,  augmented  with  a  few  DIVA-specific  functions  for  coordinating  with 
the  wide  datapath.  The  w  ide  datapath  accesses  the  scalar  registers  for  addressing  operations,  as  well  as  for  controlling  subfield  operations. 

The  W'ideWord  datapath  processes  objects  aggregated  w  ithm  a  row  of  the  local  memory  array  by  operating  on  256  bits  in  a  single  processor 
cycle.  This  fine-grain  parallelism  offers  additional  opportunity  for  exploiting  the  increased  processor-memory  bandwidth  available  in  a  PIM. 
fhe  Wide  Word  unit  can  |x:rform  bit-level  operations,  such  as  simple  pattern  matching,  or  higher-order  computations  such  as  searches  and 
reduction  operations. 

The  WideWord  datapath  has  several  features  to  distinguish  it  from  a  conventional  SIMD  architecture.  First  is  the  ability  to  change .  il.l' 
operand  width  on  a  per-instruction  basis,  enabling  it  to  treat  a  WideWord  as  a  packed  array  of  objects  of  eight,  sixteen,  or  thirty-two  bits  in 
size.  This  characteristic  means  the  WideWord  ALU  is  more  accurately  represented  as  parallel  .ALUs,  where  the  number  of  ALUs  deitends  on 
the  operand  size.  Second,  a  permutation  network  enables  applications  to  rapidly  align  and  reorganize  w  ide  register  o|x:rands.  Third,  it  sup¬ 
ports  selective  execution  of  instructions  on  sub-fields  w  ithin  a  WideWord,  de|>ending  on  the  state  of  local  and  neighboring  condition  codes. 
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Meinory  I  nit 

Instruction  Cache 

Parcel  Buffer 


Other  Node  Features 
.  {(idress  Translation 


Although  similar  designs  support  some  type  of  conditional  operation,  the  DIVA  WideWord  Unit  provides  a  much  richer  functionality 
through  the  ability'  to  specify  selectne  e.xecution  m  almost  every  wide  instruction  and  the  use  ofglobal  condition  cixie  information  m  selec¬ 
tion  decisions.  Fourth,  even  for  applications  where  the  WideWord  ALl.i  operations  are  not  applicable,  the  wide  datapath  can  be  used  to 
accelerate  memory  access  time  and  communication. 

The  memory  unit  consists  of  the  DRAM  macro  as  well  as  the  memory  controller  to  arbitrate  between  \  arious  kinds  of  access  requests  to  the 
memory.  The  memory  macro  includes  the  features  typical  of  a  standard  DRAM;  m  addition,  it  supports  a  full  address  bus,  rather  than  a  row- 
column  multiplexed  one.  fhe  memory  controller  arbitrates  memory  requests,  which  come  from  several  sources:  memory  refresh,  the  host 
interface,  the  memory  stage  of  the  node  pipeline,  and  the  node  instruction  cache.  These  sources  are  listed  m  the  order  in  which  they  are 
granted  priority. 

A  small  instruction  cache  is  included  to  avoid  instruction  accesses  interfering  w  ith  data  requests,  both  to  reduce  the  frequency  of  requests  to 
memory  and  to  maximize  the  opportunity  for  faster  page  mode  accesses  for  the  data  requests.  The  instruction  cache  is  direct  mapped,  and  the 
size  for  the  initial  implementation  is  4Kby  tes  with  32by  te  cache  lines.  Becau.se  it  caches  just  instructions,  which  are  not  expected  to  be  mod¬ 
ified  during  program  execution,  there  is  no  write  back  facility  or  other  mechanisms  for  keeping  cache  lines  coherent  with  memory.  To 
support  context  switching,  an  invalidate  instruction  permits  invalidation  of  individual  cache  lines. 

The  basic  mechanism  used  in  the  DIVA  sy  stem  to  support  parcel  sending  receiving  from  to  an  application  is  a  parcel  buffer  (or phuf).  fhe 
pbuf  has  a  virtual  as  well  as  a  physical  abstraction.  To  the  application,  the  pbuf  loeations  appear  as  regular  memory  locations  that  are  mani|v 
ulated  through  simple  loads  and  stores.  .At  a  physical  level,  the  pbuf  is  a  set  of  memory -mapped  registers,  liach  PIM  node  contains  a  pbuf 
that  serves  as  a  port  between  the  on-chip  parcel  interconnect  and  the  node  (refer  to  Figure  2).  Although  the  parcel  bulfer  could  be  imple¬ 
mented  as  registers  within  the  PIM  nvxle  processor,  a  memory -mapped  mechanism  for  the  parcel  bulTer  allows  a  uniform  implementation  for 
the  node's  pbuf  as  well  as  a  host  pbuf  Hence,  a  pbuf  within  the  PIM  chip  host  interface  is  memory-mapped  into  the  host  processor’s  address 
space  to  allow  the  host  processor  to  communicate  with  PIM  nodes  via  the  parcel  mechanism. 


DIVA  partitions  the  v  irtual  address  space  of  the  host  priKessor  into  three  classifications:  dumb,  which  represents  standard  pages  visible  only 
to  the  host;  global,  which  is  shared  by  host  and  PIM;  and  kKal.  which  the  PIM  node  u.ses  for  internal  computation  and  is  visible  to  the  host 
only  in  supervisor  mode,  fhe  local  memory  is  further  partitioned  into  segments;  global  memory  fora  particular  PIM  is  also  represented  by 
one  or  more  segments,  fo  condense  translation  information,  we  use  scjiments,  each  of  which  is  defined  by  segment  registers  containing  a 
phy  sical  base  address  and  limit,  fhe  local  memory  region  is  paintioned  into  eight  segments  at  fixed  v  irtual  bases,  for  kernel  code,  stack  and 
data,  user  code  and  data  stack,  and  for  kernel  and  u.ser  communication  bulTers.  A  small  number  ofglobal  segment  registers  are  also  used; 
since  global  segments  must  be  able  to  map  portions  of  a  shared  v  irtual  address  space  much  larger  than  the  physical  memory  of  an  individual 
nixie,  global  segments  must  be  represented  by  both  a  v  irtual  and  phy  sical  base  address  register. 

Remote  addresses  are  translated  via  the  concept  of  a  home  node,  which  is  guaranteed  to  have  the  translation.  Therefore,  a  node  must  main¬ 
tain  translation  information  for  only  eight  local  segments  plus  a  small  number  of  segments  for  its  portion  of  the  global  memory  ,  as  w  ell  as  for 
any  remote  data  for  w  hich  it  is  the  home  node  The  major  adv  antages  of  this  approach  are  that  translation  may  be  accomplished  rapidly  ,  and 
translation  information  on  each  PIM  scales  well. 
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lixceplions 


I  !xceptions.  arising  from  execution  of  node  instructions,  and  interrupts,  from  other  .sources  such  as  an  internal  timer  or  e.xtemal  interrupt  sig¬ 
nal,  are  handled  by  a  common  mechanism.  The  exception  handling  scheme  for  DIVA  has  a  modest  hardware  requirement,  exporting  much 
of  the  complexity  to  software,  to  maintain  a  llexible  implementation  platform  It  provides  an  integrated  mechanism  for  handling  hardware 
and  sofhvare  exception  sources.  .Additionally,  it  provides  a  flexible  priority  assignment  scheme  w  hich  minimizes  the  amount  of  time  that 
exception  recognition  is  disabled  While  the  hardware  design  supfHirts  traditional  stack-based  exception  handlers,  we  also  outline  a  non- 
recursive  dispatching  scheme  which  uses  DIVA  hardware  features  to  allow  preemption  of  lower-prioritv'  exception  handlers  using  a  mecha¬ 
nism  which  should  be  easier  to  debug 
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Chapter  2  -  Registers  and  Data  Types 


Introduction 


Description  of  Node 
Rcfiistcrs 


I  ser-Level  (ienenil- 
Purpose  Registers 


This  chapter  describes  DIVA's  different  registers  and  their  usages,  and  how  data  is  represented  in  these  registers.  The  scalar  and  w  ide  datap¬ 
aths  each  have  their  own  register  lile.  Whether  an  instruction  uses  the  scalar  or  wide  datapath,  arithmetic  operations  follow  a  3-register 
fomiat.  with  two  sources  and  one  destination.  Transfers  Ixitween  register  files  is  accomplished  with  explicit  niov  e  instructions.  I3ata  is  trans¬ 
ferred  between  memorv  and  registers  with  explicit  load  and  store  instructions  only.  Memory  operations  iinolving  scalar  and  wide  registers 
refer  to  memory  locations  aligned  at  32-bit  and  236-bit  boundaries,  respectively. 

The  general-purpose  registers  can  be  accessed  in  either  user  mode  or  su|Ter\  isor  mode.  Some  special-purpose  registers  can  be  accessed  in 
u.ser  mode,  but  all  remaining  special-purpose  registers  may  be  accessed  only  in  supers  isor  mode.  For  the  most  part,  the  registers  in  the  scalar 
datapath  follow  standard  RISC  systems.  The  wide  datapath,  in  contra.st.  has  several  novel  types  of  registers  to  facilitate  selective  e.xecution 
on  specific  subfields  of  the  register.  The  condition  codes  have  been  extended  on  the  wide  datapath  to  maintain  a  result  for  each  separate  data 
field,  and  branch  instructions  have  been  added  to  the  ISA  to  simultaneously  check  the  conditions  on  all  data  fields.  Another  novel  feature  of 
the  wide  datapath  is  the  ability  to  select  an  individual  subfield  of  the  wide  register,  using  either  an  immediate  or  a  scalar  general-purpose  reg¬ 
ister,  and  mo\e  the  selected  field  in  an  explicit  mo\  e  instruction. 

Bey  ond  the  standard  supers  isor-level  registers  required  for  interrupts,  exceptions  and  protection,  a  few  special-purpose  registers  m  the  sys¬ 
tem  support  DIVA-specific  activities.  Segment  registers  are  used  to  support  address  translation  Also,  an  environment  identifier  (TID) 
identifies  the  currently  active  u.ser  program,  for  protection  purposes,  as  well  as  to  support  mter-node  communication. 

Some  additional  registers  on  a  DIVA  PIM  chip  that  are  not  part  of  a  single  node  are  included  in  the  PiRC  and  host  interface,  but  a  di.scussion 
of  these  is  beyond  the  scope  of  this  document. 

The  registers  for  a  DIVA  node  are  summari/ed  in  Table  land  graphically  displayed  in  Figure  5.  This  section  describes  each  tyiv  of  register 
in  detail  In  the  classification  below,  we  first  describe  the  general-purprise  registers,  Ixith  scalar  and  wide,  then  the  special-purpose  registers, 
distinguishing  between  supers  isor-level  registers  and  user-level  registers.  Access  privileges  are  described  by  the  mixle  field  of  the  program 
status  word  ( PS W)  register.  This  organization  is  also  retlected  in  Table  land  Figure  5.  In  Table  I,  the  "ty  [se"  field  describes  the  classification 
of  each  register.  Type  scalar  and  U'lJcH'oni  refer  to  the  general-purpose  registers.  SI’  indicates  the  user-level  special-purpose  registers,  AT 
refers  to  the  address  translation  registers,  and  /’  refers  to  all  other  privileged  registers. 

This  section  describes  the  general-purpose  scalar  and  wide  registers  that  are  accessible  to  user  code. 

(iciieral-l’iirpose  Scalar  Registers 

There  are  32  general-purpose  scalar  registers,  each  32-bits  wide,  which  we  designate  as  R0-R3I  in  Figure  3  This  register  file  is  used  as  the 
source  or  destination  for  all  integer  scalar  instructions.  In  addition,  scalar  registers  are  used  to  provide  addresses  for  memory  accesses  to  sca¬ 
lar  and  wide  load  store  instructions.  Further,  scalar  general-purpose  registers  can  be  used  to  index  subfields  in  a  wide  register  during 
transfers  between  register  files  using  the  MVSWI  and  MVWSI  instructions  (see  below ).  Memory  oiierations  to  load  and  store  objects  to. 
from  a  general-purpose  scalar  register  are  aligned  at  32-bit  boundaries.  For  convenience  m  performing  arithmetic  operations  where  the 
immediate  0  is  one  of  the  operands.  RO  is  hardwired  to  hold  the  value  0. 
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I  ser-l.L’vel  Special- 
Purpose  Registers 


(ieiicral-Purpctsc  W  iclc  Registers 

There  are  32  general-purpose  wide  registers,  each  256-bits  wide,  which  we  designate  as  WR0-WR3 1  in  I'lgure  5.  This  register  tile  is  used  as 
the  source  or  destination  of  all  wide  instructions.  Wide  instructions  perform  the  same  operation  on  8-,  1 6-,  or  32-bit  subTields  of  the  wide  reg¬ 
ister,  as  designated  by  the  width  (VVW)  field  of  the  instruction  (Future  implementations  ma\  also  support  64-bit  subfields  for  wide  double- 
precision  floating  point  capability.)  The  mask  register  and  participation  mode  register  (described  below)  can  optionally  be  used  to  designate 
which  subfields  will  iiarticipate  m  an  instruction,  if  the  participation  (PP)  field  of  the  instruction  is  set. 

Wide  registers  are  loaded  from  stored  to  memory  using  addresses  from  the  general-purpose  scalar  registers.  Memory  o|ieration.s  to  load  store 
objects  to,  from  a  general-purjiose  wide  register  are  aligned  at  256-bit  boundaries.  Individual  fields  of  wide  word  registers  can  also  be  set  or 
read  using  MVSW,  MVWS.  MVSWI  and  MVWSI  instructions  that  use  a  register  or  immediate  index  to  s|vcify  the  data  field  to  be  accessed. 
In  addition  to  arithmetic  and  transfer  operations,  wide  registers  can  be  u|xiated  through  the  permutation  instructions  WPRM  and  WPRMl, 
w  hich  reorganize  the  data  fields  of  the  source  register  into  a  destination  register,  fhe  former  instruction  uses  a  third  w  ide  register  to  sfiecify 
how  the  data  fields  will  be  rearranged,  and  the  latter  performs  a  lookup  into  a  table  of  hardcoded  permutation  patterns. 

A  large  numlK’r  of  sfiecial-purpose  registers  are  directly  or  indirectly  accessible  to  the  user  program,  each  described  in  this  section. 

•  single  condition  register  for  scalar  condition  codes,  and  a  set  of  five  condition  registers  for  w  ide  condition  ctxies 

•  Scratch  registers  for  scalar  integer  multiply  and  di\  ide 

•  A  participation  mode  register  and  mask  register  to  support  selective  execution  on  the  wide  AITJ 

In  iiddition  to  being  read  written  indirectly  by  other  AI.IJ  operations,  the  DIVA  node  architecture  permits  user-level  access  to  any  special- 
purpose  register  through  explicit  moves  to  standard  registers,  using  the  M'fSPR  and  MFSPR  instructions. 

Scalar  (  onditiiiii  Kci>istcr 

The  scalar  condition  code  register,  CC  in  I- igure  5,  consists  of  5  bits,  fhe  first  three  bits  of  CC  are  set  by  an  algebraic  comparison  of  the  result 
to  zero;  the  other  two  bits  have  slightly  more  peculiar  semantics  fhe  condition  ctxies  have  the  CC  bit  labels  and  semantics  as  indicated  m  the 
table  below.  Note  that  FT,  (IT,  I:Q,  and  CA  condition  codes  are  u|xlated  only  if  the  current  instruction  has  its  condition  eexie  enable  bit  set. 
fhe  OV  condition  ctxie  is  updated  for  any  scalar  add  or  subtract  operation,  regardless  of  the  condition  code  enable  bit  setting,  and  is  sticky ; 
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that  is,  it  IS  only  cleared  when  the  condition  code  register  is  read  The>  are  accessed  in  conditional  branch  and  call  statements.  Further,  like 
any  user-level  special-purpose  registers,  they  can  be  explicitly  read  and  written  with  the  MFSPR  and  M  TSPR  instructions,  respectively. 


('ondition  Cudr 

■KMiTTl 

DescriptiMi 

1  1 

0 

1  Ins  bit  IS  set  when  the  result  is  negative. 

til 

1 

1  Ins  bit  IS  set  when  the  result  is  positive  and  non-zero. 

i:<-' 

s 

riiis  bit  is  set  when  the  result  is  zero. 

1  Ins  bit  IS  set  to  indieaie  ovei  llow  has  oeeuried  during  exeeiition  ol  an  ;idd 
or  subtraet  instruction.  This  bit  is  not  altered  by  any  other  instructions.  In 
practice,  the  OV  bit  is  set  if  the  carry  out  of  bit  0  is  not  equal  to  the  carry  out 
of  bit  1  (assuming  big  Eindian  bit  labeling). 

t  .\ 

■ 

In  general,  the  carry  bit  it  .\)  is  set  to  indicate  that  a  carry  out  ol  bit  (> 
occurred  during  e.veeution  of  an  add  or  subtraet  instruction.  I  his  bit  is  not 
altered  by  any  other  instructions. 

I■  i^ul■c4:  Scalar  ('ondition  (  ode  Register 


\Mdc  Condition  Registers 

While  the  scalar  codes  are  consolidated  into  a  single  condition  register,  the  CC  described  abo\e,  each  ty  pe  of  WideWord  condition  code  is 
allocated  an  entire  register  so  the  results  of  parallel  operations  on  objects  as  small  as  bytes  may  be  recorded,  liach  one  of  these  condition  reg¬ 
isters  IS  32-bits  w  ide,  fhus,  w  ide  condition  registers  arc  designated  as  LT,  (fr,  I-Q,  OV,  and  C,'\.  For  an  e.xample  of  how  the  w  ide  condition 
registers  are  used,  a  bit  of  the  WideWord  l.'F  register  is  set  if  the  result  of  its  corresponding  8-bit  datapath  is  negatn  e.  I  Iowe\  er,  there  are  sub¬ 
tleties  due  to  the  configurability  of  the  operand  sizes.  F  or  example,  if  a  WideWord  instruction  specifies  that  o|'>erands  are  to  be  treated  as  32- 
bit  values,  the  condition  codes  are  grouped  into  eight  groups  of  4.  where  each  bit  of  a  group  is  updated  w  ith  the  same  \alue  to  rellect  a  con¬ 
dition  for  the  group's  corresponding  32-bit  result.  Like  the  scalar  CC  register,  the  LT,  (if,  EQ,  and  CA  wide  condition  registers  are  only  set 
by  instructions  that  have  their  C  field  enabled.  The  OV  register  is  a  sticky  register  that  is  u|xlated  on  all  WideWord  add  and  subtract  opera¬ 
tions,  bits  of  this  registered  are  cleared  only  w  hen  the  register  is  read  using  an  mfspr  instruction. 

The  wide  condition  codes  are  accessed  by  the  branch  instructions  BAx  and  BNx,  which  represent  Branch-()n-.All  and  Branch-On-Kone  con¬ 
ditions  for  the  appropriate  w  ide  condition  register  represented  by  x. 

N\  ide\\ord  Floating-Poiiil  Status  Ucnister 

Similar  to  condition  codes,  the  WideWord  floating-point  status  register  ( I'PSR  -  s|x;cial-purfHise  register  15)  may  l>e  updated  to  rellect  excep¬ 
tion  conditions  for  lloating-point  operations.  This  register  is  a  32-bit  register  arranged  in  group  of  4  status  conditions  for  each  of  the  eight  32- 
bit  lloating-point  units  in  the  WideWord  datapath.  The  4  status  conditions  are:  divide  by  zero  (DZ).  invalid  (IV).  ine.xact  (IX),  and  unsup¬ 
ported  value  (IJV ).  DZ,  IV,  and  IX  are  ty  pical  IEF.E-754  tloating-point  exceptions.  Refer  to  the  II:EE-754  standard  for  details.  IJV  indicates 
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that  either  overllow  or  underflow  occurred  at  some  point  during  the  program.  All  bits  of  FPSR  are  sticky;  once  set.  they  remain  set  until 
FPSR  IS  read  \  ia  an  mfspr  instruction.  The  bit  arrangement  for  FPSR  is  shown  below. 


Supen'isor-Leve!  Address 
Translation  Registers 


DZi; 

IVO 

1X0 

rvi: 

:)Zi 

IV 1 

IXI 

ill 

l)Z7 

IV7 

1X7 

uv 

0  31 

I  I’SR  Bit  Arrangement 


Scratch  registers  for  integer  multiplies  and  disides 

■fwo  registers,  designated  I II  and  1.0  in  Figure  5,  are  automatically  set  as  the  result  of  a  .scalar  integer  multiply  or  di\  ide.  I II  holds  the  most 
significant  32  bits  of  a  multiplication  result  or  the  remainder  of  a  division.  1.0  has  the  least  significant  32  bits  of  a  multiplication  result  or  the 
quotient  of  a  division. 

Participation  Mode  Register 

The  Participation  Mode  (P.M)  register  is  a  5-bil  register  that  descrilves  the  conditions  for  selective  execution  of  a  wide  instruction  that  has  its 
PP  field  set.  The  conditions  correspond  to  the  four  condition  codes  or  the  mask  register  M  (as  will  be  discussed  in  t.'hapter  5).  The  PM  reg¬ 
ister  IS  read,  written  using  the  MFSPR  and  MTSPR  instructions.  It  is  also  updated  automatically  to  select  M  for  participation  when  the 
mask  register  M  is  updated. 

Mask  Register 

The  mask  register  is  a  32-bit  register  used  in  participation,  which  we  refer  to  as  M  in  Figure  5.  If  the  PP  field  of  a  wide  instruction  is  set,  and 
the  M  bit  of  the  PM  register  is  set,  then  the  instruction  is  conditionalK  executed  on  each  data  field  that  has  its  corresponding  bit  in  the  M  reg¬ 
ister  set.  Like  the  WideWord  condition  codes,  if  the  width  of  each  field  is  larger  than  8  bits,  multiple  bits  in  the  M  register  will  be  set 
corresponding  to  a  single  data  field  (2  for  16-bit  widths.  4  for  32-bit  widths).  Update  of  the  M  register  automatically  causes  the  M  bit  of  the 
PM  register  to  be  set. 


total  of  28  32-bit  registers  related  to  local  and  global  segments  are  used  to  perform  translation  of  virtual  addresses  to  physical  addresses  by 
the  node  processor.  A  detailed  description  of  how  these  registers  are  used  in  the  address  translation  prtK'ess  can  be  found  in  Chapter  10.  The 
registers  are  set  by  superv  isor-le\  el  softw  are  using  MTPR  instructions,  usually  as  a  result  of  a  context  sw  itch  or  a  change  in  the  size  or  loca¬ 
tion  of  current  global  segments.  They  are  read  either  by  MFPR  instructions,  or  more  commonly,  directly  by  address  translation  hardware. 

.A  set  of  1 6  registers  support  local  segments,  referring  to  addresses  local  to  the  PIM  nrxie  that  are  inaccessible  to  host  user  crxle  or  other  PIMs 
nodes.  There  are  eight  local  segments,  with  two  registers  representing  each  segment.  The  Local  Segment  Base  registers  (SB0-SB7)  hold  the 
physical  ba.se  address  of  each  local  segment  The  Local  Segment  Limit  registers  (SL0-SL7)  hold  the  maximum  olTset  from  the  base,  for 
address  bounds  checking,  as  well  as  some  additional  bits  to  support  access  protection. 

A  set  of  12  registers  support  global  segments,  referring  to  addresses  that  may  be  shared  between  host  and  PIM.  There  are  four  global  seg¬ 
ments,  and  each  is  supported  by  three  separate  registers,  (ilobal  segments  must  be  able  to  map  [lortions  of  a  shared  virtual  address  space 
much  larger  than  the  phy  steal  memory  of  an  individual  node.  For  this  reason,  global  segments  have  both  (ilobal  Segment  Phy  steal  Base  reg- 
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Other  Supervisor-Level 
Repisters 


isters  ((iPB0-(iE’B3),  similar  to  local  segments,  as  well  as  (ilobal  Segment  Virtual  Base  Registers  ((iVB0-CiVB3).  Usages  of  the  (ilobal 
Segment  Limit  registers  ((JI.0-(}L3)  are  analogous  to  the  SL0-SL7  registers  for  local  segments. 

number  of  other  suf)er\  isor-level  registers  are  included  to  support  the  PIM  run-time  kernel  actiMties.  These  can  be  classified  into  the  fol¬ 
lowing  categories: 

•  Scratch  registers 

•  The  program  counter 

•  fhe  priKessor  status  word 

•  The  en\  ironment  identifier 

•  I  liner  registers,  including  two  to  hold  current  system  clock  and  one  used  as  a  countdown  timer 

•  Registers  to  support  interrupts  and  exceptions,  a  total  of  se\en 


While  in  some  ca.ses  these  registers  are  ufKlated  as  a  result  of  a  hardware  event  or  upon  e.xecution  of  some  other  instruction,  all  of  the  regis¬ 
ters  can  be  read  from,  written  to  general-purivose  registers  by  the  supervisor-level  instructions  MFPR  and  MTPR.  There  are  two  exceptions 
to  this,  fhe  Program  Counter  is  set  only  by  hardware,  and  cannot  be  accessed  directly ,  even  by  superv  isor-lev  el  cixie;  for  this  reason,  it  is  not 
given  a  register  class  in  fable  1.  .Mso,  the  l:xception  Source  Word  (HSW)  is  set  in  software  only  indirectly  through  the  fixception  Set  Reg¬ 
ister  and  the  l:xception  Re.set  Register,  although  it  can  be  read  by  MFPR;  MTPR  to  the  HSW  is  undefined  and  is  treated  as  a  no  op  by  the 
hardware. 


Scratch  rcjiistcrs 

Four  32-bit  scratch  registers,  designated  SCR0-SCR3  in  Figure  5,  are  used  by  the  kernel  for  its  v  arious  activ  ities.  The  goal  of  having  these 
additional  registers  is  to  avoid  the  need  to  save  and  restore  context  of  general-purpose  registers  when  switching  between  the  kernel  and  user- 
level  code.  The  kernel  can  instead  copy  the  contents  of  up  to  four  of  the  general-purpose  registers  into  SR0-SR3,  then  use  the  general-pur¬ 
pose  registers,  and  subsequently  restore  the  contents  of  the  general-pur|vose  registers,  thus  avoiding  more  costly  memory  accesses. 

Pro»nini  counter 

The  program  counter  (PC)  maintains  the  address  to  the  current  instruction  to  be  e.xecuted.  .Mthough  user  code  causes  the  PC  register  to  be 
updated,  it  is  updated  indirectly  through  the  execution  instructions  that  change  the  Row  of  control  in  the  program  (i.e.,  branches,  procedure 
calls  and  interrupts  and  exceptions). 

Upon  execution  of  a  branch  instruction,  the  PC  is  updated  by  hardware  to  the  target  of  the  branch.  For  a  CALL  instruction,  the  current  PC  is 
copied  into  SR3I,  and  then  the  PC  is  updated  to  the  starting  ivoint  of  the  called  function.  A  subsequent  RIT  instruction  will  cau.se  R3I  to  be 
copied  back  to  PC.  On  an  interrupt  or  exception,  the  current  PC  is  automatically  copied  into  the  FADR  register  (see  description  below),  and 
is  restored  from  FADR  upon  execution  of  a  RFE  instruction. 

Processor  status  word 

The  prtK'essor  status  word  is  show  n  as  PSW  in  fable  5.  A  detailed  description  of  the  PSW  and  its  operation  is  given  m  Chapter  8. 
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Kn\  iroiiinciit  idriililler 

A  16-bit  lilD  register  records  the  currently  active  user  context,  and  it  is  used  to  support  communication  between  PIM  nodes.  A  parcel  arriv¬ 
ing  at  a  PBI.IF  must  have  an  F.ID  in  the  header  that  matches  the  current  lilD  register;  otherw  ise,  the  parcel  must  be  buHered,  awaiting  a  IMM 
context  switch.  The  HID  register  is  set  by  the  kernel  ui>on  PIVl  context  switch 


l  iincr  registers 

Two  32-bil  regi.sters,  RCL  and  RCTl,  hold  the  low-order  and  high-order  bits,  respectively,  of  the  real-time  clock.  The  real-time  clock  pro- 
s  ides  a  high-resolution  measure  of  real  time  for  indicating  the  time  of  day  and  date.  The  combination  of  RCT.  and  RCII  may  be  viewed  as  a 
loadable  64-bit  counter.  At  reset,  the  value  of  RC.'II  and  RCL  are  all  Os  and  begin  incrementing  when  reset  is  released.  The  real-time  cIvKk  is 
clocked  by  the  CPU  clock.  Considering  a  probable  CPU  frequency  range  of200MI  Iz  to  Kill/,  for  implementations  over  the  life  of  this  archi¬ 
tecture,  the  real-time  clock  will  provide  ranges  of  approximately  1 17  to  585  years  at  a  Ins  to  5ns  resolution,  respectively.  RCII  and  RCL 
values  may  l'>e  initialized  to  desired  values  through  the  use  of  the  MTPR  instruction  and  are  read  using  ihe  MI'PR  instruction. 

The  TIMIiR  register  is  a  32-bil  decrementing  counter  that  prov  ides  a  mechanism  for  causing  an  interrupt  after  a  programmable  delay.  The 
frequency  of  the  TIMIiR  decrement  is  the  same  as  the  CPU  clwk  frequency.  The  I'lMIiR  causes  an  e.xception  (subject  to  masking)  when  it 
reaches  0  and  begins  immediately  to  count  down  the  next  interval  without  prvKessor  intervention.  The  interval  is  set  by'  loading  the  TIMliR 
register  with  the  interval  value  by  initially  using  an  M'fPR  instruction  Subsequently,  the  fl.VlHR  returns  to  the  interval  value  the  next  cycle 
after  counting  down  to  a  0  value. 

Kcgislers  to  support  interrupts  and  exceptions 

There  are  seven  32-hit  registers,  shown  m  f  igure  5,  that  are  used  to  support  interrupts  and  exceptions.  A  detailed  description  of  their  usage 
can  be  found  in  Chapter  8 

The  Stored  PSW  register  ( SSW )  holds  the  value  of  the  PSW  immediately  prior  to  the  interrupt  or  e.xception.  The  M.'VDR  and  T.ADR  registers 
hold  the  address  of  the  faulting  memory'  address  and  or  faulting  instruction,  m  the  event  of  an  exception.  If  the  cause  of  the  exception  was 
just  a  normal  timer-initiated  interrupt,  the  FADR  register  will  hold  the  next  instruction  to  be  executed.  All  three  of  these  registers  are  set 
either  by  hardware  m  the  event  of  a  hardware  e.xception.  or  by  MTPR  instructions  at  the  beginning  ofa  software  exception.  The  PC  and  PSW 
registers  are  restored  with  the  values  of  lADR  and  SSW,  respectively,  on  execution  ofa  RFE  instruction. 

The  four  additional  registers  to  supixm  exceptions  are  the  Eixeeption  Hnable  Mask  register  (1:MR).  the  Exception  Source  Word  (ESW),  the 
Fi.xceplion  Set  register  (1- SR)  and  the  E.xception  Reset  register  (ERR).  The  1-MR  register  indicates  which  exceptions  are  currently  enabled, 
and  IS  set  by  the  superv  isor.  Fields  of  the  [:SW  are  set  to  I  either  directly  by  hardware  m  the  event  ofa  hardware  e.xception.  or  by  software 
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setting  corresponding  bits  in  the  ESF<  register  for  software  exceptions.  Fields  of  the  F:SW  are  cleared  to  0  by  sottware  setting  corresi'Hinding 
bits  in  tbe  EF^Fi  register.  A  description  of  tlie  bit  Fields  and  their  meaning  can  lx:  found  in  Chapter  8. 
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User-Level  Registers 


Scalar  Rcjiistcrs 


\MikA>  ()rtl  Rcijistcrs 


SRO  C 


(iciieral-Purposc 

Registers 


SRJI  C 


n  WRO 


!]  WR3I 


c<  , - 

— 1 

IT  rz 

- - 1 

1.0  1 - 

l  ser-l.e\el 

(.T  1 — 

1 

M  1 - 1 

EQ  d 

1 

PP  □ 

III  r — 

- 1  Registers 

03  1 - 

CA  1 _ 

- 1 

'J 

L 

r  ' 


Sii|)er\  isor-Level  Registers 


Local  Scjjmcnt  Registers 

SBO  ' - 


^  SLO  ^ 


SB7 


□  SL7  C 


Address 

rraiislatioii 

Registers 


(;VB0 


(ilobal  Sc”mciit  Rcfjislcrs 

i~ 


(;pBo  ' 


(;\B3 


(;PB3  c 


(;lo 


(aj  c 


S<  R0  C 


SCR3 


Siipcn  isor-Lcvcl  Spccial-Purposc  Rcfjistcrs 

I  I  TIMER  I  I 

PSM  I  I  RCL  I  I 

R(TI 


ESW  C 
EMRC 


EID 


I  I 


c 


□ 


ESR  c 
ERR 


SSW 

FADR 

MADR 


l  i^urc  5:  DIN  A  Node  Registers 


181 


Operand 

(onvciitions 


As  stated  earlier,  nieinor>  operations  are  assumed  to  be  aligned  at  32-bit  boundaries  for  the  scalar  datapath,  and  256-bit  boundaries  for  the 
wide  datapath  Thus,  on  meiTior>  operations,  the  appropriate  number  of  least  significant  bits  in  the  address  should  be  0  (last  2  for  scalar  data¬ 
path.  last  5  Ibr  WideWord  datapath).  .Addresses  in  mentors  operations  that  do  not  conform  to  these  rules  will  trigger  an  exception. 

Following  the  convention  of  the  PowerPC  host,  bits  and  bvtes  are  stored  m  Biglindian  order  in  memorv. 


182 


Chapter  3  -  ISA  Suniniary 


Scalar  Instruction 
I'ormats 


As  shown  in  Figure  6.  the  DIVA  scalar  instruction  uses  a  three-operand  format  to  specify  two  32-hit  source  registers  and  a  32-bit  target  reg¬ 
ister.  For  arithmetic  logical  instructions  using  this  format,  there  is  also  a  ('  bit  to  indicate  whether  the  current  instruction  updates  condition 
ccxies.  I  lowes  er,  the  ('  bit  indicates  signed  unsigned  arithmetic  for  multiply  di\  ide  instructions,  since  the.se  instructions  never  u|xiate  condi¬ 
tion  codes  b\  definition.  In  lieu  of  a  second  source  register,  a  16-hit  immediate  value  m,i>  be  speeilied.  as  shown  in  Figure  7. 
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I'i"urc  6:  rormat  R  for  Scalar  Rc"isfcr  Operations 
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5  bits 
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I'imirc  7:  rormat  I  for  Scalar  Immediate  Operations 


Fhe  branch  instruction  formats  are  shown  in  Figure  8  Fhe  branch  target  address  ma\  be  PC-relative  or  calculated  using  a  base  register  ORed 
w  ith  an  otVset.  In  both  formats,  the  otVset  is  in  units  of  words,  or  4  b\tes,  since  instructions  must  be  on  a  4-b\te  boundary  .  Furthermore,  the 
I.  bit  specifies  linkage,  that  is,  whether  a  return  instruction  address  should  be  saved  in  R3I,  referred  to  as  a  call  instruction.  .Also,  the  (('(' 
field  specifies  one  of  eight  branch  conditions:  always,  equal,  not  equal,  less  than,  less  than  or  equal,  greater  than,  greater  than  or  equal,  or 
overflow.  See  the  branch  and  call  instruction  descriptions  in  the  DIVA  ISA  dwument  for  details. 
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I'ifjurc  S:  I'ormat  B  for  Brunches 


Wn\c\\  ord  As  shown  in  Figure  9,  "WideWord  Arithmetic  Logical  Format.’"  VVideWord  instructions  follow  the  general  form  of  scalar  instructions.  Addi- 

Instriiction  fOrmats  tional  control  information  is  included  to  manage  the  data  fields  of  the  WideWord.  and  to  modify  the  e.xecution  of  the  instruction.  Figure  10 

shows  the  format  for  transfers  within  the  WideWord  register  file  and  across  the  scalar,  FP,  and  WideWord  register  files. 


rijiure  9:  Format  for  ^^  idc^^  ord  .Vrithmctic/Lo»ical  Operations 


(■>  hits _ _ 5  hits _ _ 5  hits _  5  hits _  2  hits  2  hits  (>  hits 

opcode  rl)  r.V  T  PP  WAN  function 

Fifiurc  10:  Format  I  for  A\  idc-AAord  and  Intcr-Rcsistcr  File  Transfers 

The  control  fields  are  defined  as  follows: 
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ffV  (width) 

The  HU  field  sets  the  width  of  the  WideWord  oivrands  to  eight,  sixteen,  or  thirt\-two  bits,  which  primarily  affects  the  shift 
o|‘>erations  and  the  configuration  of  the  carry  chain  for  additions  and  subtractions,  for  the  merge  instruction,  these  bits  specify 
the  condition  on  which  the  merge  is  based.  The  encoding  of  these  bits  is  listed  in  the  following  table: 


W  W  Value 

0|MM‘aft(l 

Assembler  Mnemonic 

00 

8  bits 

b 

01 

16  bits 

h 

10 

32  bus 

vv 

II 

Resetved 

NA 

C  (condition  code  enable) 

The  C  bit  indicates  whether  condition  codes  will  be  updated  as  a  result  of  the  current  instruction's  e.xecution  I  lowever,  the  C 
bit  indicates  signed  unsigned  arithmetic  for  multiply,  pack,  and  unpack  instructions. 

PP  (participation) 

fhe  /’/’  field  interacts  with  condition  codes  to  control  whether  a  computation  is  performed  on  a  given  data  field.  The 
participation  field  can  specify  that  a  data  field  participate  always,  only  if  a  condition  Iwal  to  its  own  data  Held  is  true,  only  if 
the  data  field  is  the  leftmost  field  with  a  condition  that  is  true,  or  only  if  the  data  field  is  the  rightmost  field  with  a  condition  that 
is  true.  The  condition  that  is  inspected  for  participation  depends  on  the  value  of  the  P.M  ( participation  imxlei  register.  Refer  to 
Chapter  5  for  more  details.  The  encoding  of  the  /7’bits  is  listed  m  the  follow  ing  table: 


PP  Value 

Participiitioii  Dennilioii 

Assembler  Mnemonic 

00 

Alvva>'s  participale 

a 

01 

Specified  b>'  local  condilion 

0 

10 

Lefimosi  participalion 

1 

11 

Rightinosi  panicipauon 

r 

r  tope) 

The  7' bit  go\erns  whether  the  current  instruction  operates  on  a  vector  or  scalar.  Depending  on  the  function,  rl)  or  r  t  may 
specify  a  WideWord  register.  In  this  case,  the  Tbit  specifies  whether  the  current  transfer  instruction  refers  to  the  WideWord 
register  as  a  w  hole  vector  or  instead  uses  to  index  a  sub-field  of  the  WideWord  register. 

^A/n 

Value  to  be  used  as  an  index  when  a  sub-field  of  a  WideWord  is  in\olved  m  a  transfer.  Depending  on  the  function,  this  index 
field  may  be  an  immediate  or  a  scalar  (iPR  specifier.  Also,  may  be  coupled  with  either  rD  or  rA  depending  on  the 
direction  of  the  transfer  as  specified  by  the  lunction. 
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(  oiicisc  List 


A  concise  list  of  the  instructions  in  the  DIVA  Instruction  Set  Architecture  ( ISA)  is  given  in  Table  2. 

TABLL  2.  I)I\  A  Instriictioii  Set 


n  N( 

DEsr  RirnoN 

El  M 

Di:S<  RIP1ION 

ELM 

DES(  RirriON 

l(  u 

V'livn  (  all 

Insiiuclion  Cache  Line  Invalidate 

Move  to  s(>ecial-puitH)se  reg 

Hi 

Blanch  on  scalar  condition 

MESI’R 

Move  fiom  s|)ecial-purfH)se  reg 

B.V\ 

Blanch  on  all  W'ideWord  conditions 

Kl  L 

Rclurii  from  l-.\ccplioii 

MTPR 

Move  to  priitectevi  reg 

BVv 

Blanch  on  no  ideWord  condition 

Ml  I’R 

Move  tiom  pioiecled  reg 

(  AI.I.v 

Call  on  scalar  condition 

Scalar  Inslruclions 

VrTATR — 

Move  to  address  translation  reg 

1  AI.L.Vi 

Call  on  all  VVideWoid  conditions 

ADD 

Add 

MEATR 

Move  ftoin  address  translation  reg 

(  ALLN.V 

(  all  on  no  WideWoid  condition 

ADDt: 

Add  extended 

Add  immediate 

WideNVord  lnHlruclion.s 

ADDK 

Add  immediate  vv  condition  codes 

WADD 

A3d 

SI  B 

Subtiact 

WADDE 

Add  extended 

SI  BE 

Subtract  extended 

WSI  B 

Subtiact 

SI  Bl 

Subtiact  unsigned 

WSI  BE 

Subtiact  extended 

Special  \\  idrWoril  Insiructioiis 

Ml  1. 

MullipK 

\\  SI  Bl 

Subtiact  unsigned 

WTRM 

Peiniiile 

Mill 

Mullipj\  unsiened 

WMI'I.ES 

Multiply  even  signed 

WPRMI 

IViiuule  immediate 

IMV 

Divide 

WMIT.ri 

Multiply  even  unsigned 

WMRC 

Merge  based  on  condition  codes 

l>l\'l 

l^ivide  unsigned 

WMI'I.OS 

Multiply  odd  signed 

\\  rivs 

Pack  using  signed  arithmetic 

AM> 

Aiu: 

WMI'I.OI' 

Multiply  odd  unsigned 

xmn — 

Pack  using  unsigned  arithmetic 

AM>I 

And  immediate 

WAND - 

And 

Ul  l^kll 

Cnpack  high-ordei  bvte  hallwvird 

AM>K 

And  immediate  vs  condition  codes 

WMVI - 

Bitwise  inversion 

wri'ki. 

1  npack  lovv-oider  bvte  halfword 

\0i 

liitwoe  inversion 

wrm - 

0: 

ou 

Ot 

WVOR - 

Xor 

Transfer  Instruclions 

OKI 

Or  immediate 

VVSI.I. 

Shift  left  logical 

M\S\V 

Move  scalar  to  \VW' 

ORK 

Or  immediate  vv  cxHidituai  codes 

WSl.I.I 

Shift  left  logical  immediate 

MASVVI 

Move  scalar  to  \VW,  indirect 

ORIS 

Or  immediate  shiHed 

WSRA 

Shift  right  arithmetic 

MAAVS 

Move  WAV  lo  scalar 

\OR 

\or 

USRAI 

Shift  right  arithmetic  immediate 

MVAVSI 

Move  WAV  10  scalar,  indirect 

\ORI 

\or  immediate 

WSRL 

ShiH  right  logical 

MAAVAV 

Mdvc  \VW  10  WAV 

XORK 

\or  immediate  vv  condition  codes 

WSRLI 

ShiB  right  logical  immediate 

M\AVAVI 

Move  W'W'  lo  W'W,  indirect 

Sl.I. 

Shift  left  logical 

WED 

Uiad  Reg  fiom  Mem 

SI.I.I 

Shift  left  loi^ical  immediate 

WST 

Stole  Reg  to  Mem 

SRA 

Shift  light  arithmetic 

VVEABS 

Floating-|H)mt  absolute  value 

SRAI 

Shift  light  arithmetic  immediate 

WEADD 

noating-fHimt  add 

Misceihuieous  Instructions 

Shilt  light  logical 

UI'DIV 

floating-ixnnt  divide 

rtrci! 

Lock  Load 

STm 

Shift  light  logical  imniediale 

Wl'MIT. 

f  loaling-point  mullipK 

I.OKS 

Lock  Sloie 

l.l> 

Load  Reg  lioni  Mem 

WFVEC 

1  loaling-poinl  negate 

PROBI' 

Probe  address  li)  detennme 

ST 

Store  Reg  to  Mem 

\M'm  R 

|-loating-(Hunt  subtract 

k>cali!v 

\mT 

hloaiing-|H)int  to  integer  conversion 

IJ.O 

Lna>de  leftmost  one 

\nTF 

Integei  to  floating-point  conversion 

n.o 

(  lea:  leftmost  one 
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Chapter  4  -  Execution  Pipeline  and  Scalar  Datapath 


Introduction 


Pipeline  Stapes 


Major  Signal  Paths 


riie  DIVA  execution  pipeline  is  modeled  as  a  11%  e-stage  architecture,  and  is  used  to  control  the  operation  of  the  scalar  and  WideWord  datap¬ 
aths.  Because  the  combined  pipeline  and  scalar  datapath  are  quite  similar  to  familiar  RISC  processor  architectures,  the  operation  of  these 
units  are  detailed  together  to  simplify  description.  A  later  section  will  describe  the  operations  of  the  WideWord  datapath.  The  stages  of  the 
pi|K*line  are  named  here,  with  an  explanation  of  the  major  events  occurring  within  that  stage  of  execution. 

We  establish  the  convention  that  each  stage  views  it's  local  instruction  and  output  to  be  synchroni/ed  at  the  next  clock  edge  as  the  current 
instruction.  While  from  an  e.xternal  v  ievv,  there  are  /iw  instructions  ‘■currently"  executing,  the  ALU  stage  sees  an  o|Kode,  two  operands,  and 
control  and  stored  state  as  components  of  the  "currenf  instruction.  I  his  view  of  execution  local  to  each  stage  is  the  convention  used  in  all 
descriptions  of  the  pipeline. 

/■■  -  instruction  fetch 

The  F  stage  of  the  pipeline  is  where  the  address  of  the  current  instruction  is  applied  to  the  instruction  cache  and  the  instruction 
is  located.  At  the  end  of  the  cycle  the  output  of  the  instruction  cache  is  latched  into  the  first  register  stage  of  the  pipeline. 

Purinp  the  I  'stuf’e,  the  ackiress  for  the  next  instniciion  is  caiculiiteci.  Note  ihui  the  calculation  applies  to  scquciiiial  uihiresscs 
as  well  as  hninches. 

/)  -  register  decode 

During  the  R  stage,  operands  for  the  current  instruction  are  selected  from  the  register  file  or  the  most  recent  value  in  the 
pipeline  forwarding  logic.  In  the  case  of  an  immediate  instruction,  immediate  field  of  the  current  instruction  is  routed  to  the 
SRC2  piiieline.  The  result  is  latched  into  the  datapath  D-stage  registers. 

.Y  -  execute 

Depending  on  the  instruction,  the  X  stage  selects  either  the  operands  from  the  local  register  file,  or  an  operand  from  the 
WideWord  register  file,  and  forwards  the  result  to  the  .AI.U,  which  performs  the  computation  defined  by  the  opcode  and 
value  fields  of  the  current  instruction. 

M  -  memory 

Register  load  and  store  instructions  require  memory  accesses.  To  maintain  consistency  with  the  normal  register-write  logic, 
memory  operations  are  begun  during  the  M  cycle,  and  the  pipeline  is  stalled  until  memory  arbitration  and  the  required  read 
oi'ieration  has  been  performed.  During  memory  write  operations,  the  piiveline  is  released  as  soon  as  arbitration  grants  access  to 
the  memory. 

If  -  write 

During  the  W  stage,  the  register  file  is  written  with  the  result  of  the  current  o|x:ration.  whether  a  computation  or  a  memory 
read.  During  the  W  stage,  memory  write  operations  are  allowed  to  complete. 

Major  data  and  control  paths  ot'the  DIVA  node  priK’essors  are  shown  in  Figure  1 1  and  Figure  12.  Fxecution  pipeline  logic  is  depicted  in  the 
shaded  area  of  the  figures,  while  the  unshaded  area  of  the  figures  shows  the  control  pipeline  and  scalar  datapath. 
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AdJr 

1*1 

SR(  1 

In 

Scalar 

SRCZ 

Reg.  File 

T(.T 

SRCl 

Old 

SR<'2 

l  imirc  12;  Dl\ A  5-Stagc  lAecution  Pipeline  (I)  throii}jh  \\  Stajjes) 
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Scalar  (  <>mputin^ 
Fiiiictions 


The  scalar  datapath  performs  operations  on  objects  of  32  bits  or  less  Refer  to  the  DIVA  Instruction  Set  Architecture  document  for  a  com¬ 
plete  description  of  these  o|'>erations. 


1)1  \  A  Pipeline 
Analysis 

.  Iddreas  Calculations 


Numerous  examples  of  a  five-stage  pipeline  exist  in  the  literature,  providing  a  starting  design-point  for  new  machines,  including  DIVA.  We 
perform  an  analysis  of  the  DIVA  pipe  to  ensure  no  undue  overhead  is  incurred  b\'  branches  or  other  changes  in  program  How. 

Figure  13  below  is  excerpted  from  the  earlier  execution  pipeline  illustration.  Figure  IF  The  address  calculation  portion  of  the  pipeline  has 
been  highlighted  to  clarify  the  several  ivarallel  paths  used  to  dev  elop  the  address  of  the  next  instruction  to  be  executed.  Address  computations 
are  performed  in  parallel  to  guarantee  the  fastest  possible  operations  fhe  address  calculations  indicated  in  the  figure  are:  pc  increment, 
pc  qfjsei,  and  rcfiister  offset,  which  corresfHvnd  to  the  ty|vs  of  branches  supported  by  DIVA. 


I'ijiurc  13:  1)1  \  A  Iiistruction-.Vddrcss  Pipeline 


Branch  Pipeline  States 


Pipeline  lla/ards 


Instruction  Setfuences 


Register  Operations 


Memory  Operations 


As  shown  in  F-'igure  13,  a  branch  instruction  incurs  a  t\so-clock  delay,  or  stall,  before  the  first  post-branch  instruction  can  be  accessed  from 
the  instruction  cache  and  loaded  into  the  execution  pipeline.  Because  the  branch  instruction  doesn't  deiXMid  on  the  two  ADD  instructions,  it 
should  be  possible  for  the  compiler  to  mo\e  the  branch  instruction  to  the  point  before  the  first  ADD  in  order  to  avoid  totalb  wasted  dead 
spots  (or  "bubbles")  in  the  How  of  program  execution.  In  the  event  a  code  sequence  cannot  be  rescheduled,  either  logic  is  required  to  keep 
the  pipeline  executing  correctly,  or  NOP  instructions  inserted  into  the  program  to  ensure  proiver  operation.  ()b\  iously,  it  is  simplest  to  insert 
the  NOP  instructions  as  the\  require  minimal  pipeline  control  logic 

In  piivlined  systems,  hazards  (Kcur  when  an  operation  is  begun  before  another  has  completed,  or  before  required  results  are  axailable  In 
DIVA,  these  are  broken  down  into  three  classes:  insirnclion  sccjiicnccs,  register  opcraiions,  and  memory  operations.  I;ach  of  these  hazard 
classes  is  described  below. 

There  are  several  instances  of  instructions  that  incur  hazards  due  to  "extra"  time  required  for  completion,  .^mong  these  instructions  arc  inte¬ 
ger  multiply  and  divide.  When  these  instructions  reach  the  e.xeeute  (X)  stage  of  the  pipeline,  the  pipeline  is  stalled  for  the  required  number 
of  clock  cycles. 

Register  hazards  occur  when  an  instruction  requires  an  oiverand  that  is  currently  in  the  data  pipeline  In  the  simplest  case,  consider  a  stream 
of  instructions  where  a  register  is  required  in  the  same  clock  cycle  where  it  is  being  written  into  the  register  file.  I'his  hazard  can  be  verv  sim¬ 
ply  eliminated  by  requiring  register  writes  to  complete  in  the  first  half  of  each  clock  cycle,  and  performing  all  register  reads  during  the 
second  half  This  is  well  within  the  capabilities  of  the  technology. 

Consider  the  following  code  sequence,  where  an  operand  is  not  reads : 

ADD  R3,  Rl,  R2  /*  R3  =  R1  +  R2  */ 

ADD  R5,  R3,  R4  /*  R5  =  R3  +  R4  */ 

Becau.se  R3  is  emerging  from  the  ALU  as  the  first  instruction  finishes  execution,  it  is  not  available  to  be  fetched  from  the  register  file.  This 
hazard  requires  bypassing  or  foneanlmg  to  get  the  most  recent  copy  of  a  register  from  a  later  stage  in  the  pipeline,  and  move  it  to  the  ALU 
inputs.  Selection  is  performed  by  comparing  the  destination  address  of  every  register  in  the  pipeline  against  the  register  speeifications 
accessing  the  register  file.  The  most  recent  copy  (closest  to  the  ALU)  is  selected,  resoK  ing  events  where  several  copies  of  a  register  are  in  the 
pilieline. 

Memorv -related  hazards  can  iKcur  in  DIVA.  These  are  caused  bv  the  proximity  of  register  load  and  store  instructions.  Consider  the  follow¬ 
ing  code  sequence,  w  hich  is  ty  pical  of  moving  data  for  further  priKessing: 


MOV 

Rl, 

RO 

/* 

initialize  the  index 

LD 

R2, 

TABLl, 

Rl 

/* 

*/ 

ST 

R2, 

TABL2 , 

Rl 

/* 

*/ 

ADD 

Rl, 

0x1 

Now  It  IS  impossible  for  both  the  execution  pipeline  and  the  memory  to  respond  to  these  two  instructions  as  written.  First,  the  pipeline  can't 
store  a  value  that  has  not  yet  loaded:  the  register  vvrite-baek  stage  is  after  the  memory  write  stage.  Second,  there  is  no  guarantee  that  the 
objects  LABLI  and  ■rABL2  are  loeated  m  the  same  open  row  in  memory.  As  a  result,  an  unknown  number  of  delays  will  occur  before  the 
store  request  will  start  in  the  memory  . 
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Chapter  5  -  WitIcWord  Datapath 


Participation 


Participation  field 


Participation  Mode 


The  WideWord  ALU  supports  selective  execution  of  instructions  on  sub-fields  within  a  VVideWord  Under  selective  execution,  only  the 
results  corresponding  to  the  data  paths  that  participate  in  the  computation  are  written  back,  or  committed,  to  the  instruction's  destination  reg¬ 
isters.  The  data  fields  that  participate  in  the  conditional  execution  of  a  given  instruction  are  derived  from  the  condition  codes  or  the  ma.sk 
register,  plus  the  instruction's  participation  field  fhe  conditions  used  (condition  ct>des  or  mask  register)  are  sjvecified  in  the  participation 
mode  register  fhe  instruction's  participation  field  detenmnes  how  the  condition  code  (or  mask  register)  bits  are  combined  to  specify  the  par¬ 
ticipation  of  each  data  ixith. 

Kach  WideWord  instruction  w  ith  support  for  conditional  execution  has  a  2-bit  participation  field.  The  participation  field  specifies  four  ways 
in  which  the  condition  code  (or  mask  register)  bits  are  combined  for  determining  participation  of  each  data  path:  ( 1 ) .  ilivays  participate, 
where  all  data  fields  participate;  (2)  Local  participation,  where  a  data  field  participates  only  if  a  condition  Ux:al  to  its  own  data  path  is  true; 
(3 )  Leftmost  participation,  where  only  the  leftmost  data  field  w  ith  a  condition  that  is  true  participates;  and  (4)  Rightmost  participation. 
where  only  the  rightmost  data  field  with  a  condition  that  is  true  participates.  The  encoding  of  the  participation  field  (/’/’)  bits  is  described  in 
the  diKument  "DIVA  ISA  Overv  iew",  and  is  also  listed  in  the  following  table: 


PP  N  aluf 

Participation  IHTinitiun 

(Ml 

.\lwaN  s  participate 

01 

I.rK'al  participation 

10 

Leftmost  participalion 

II 

Rightmost  participation 

The  conditions  that  are  inspected  for  participation  deiiend  on  the  value  of  the  Participation  Mode  iPM)  register.  The  PM  register  is  a  5-bit 
register  that  is  read  written  using  the  mfspr  mtspr  instructions.  The  conditions  correspond  to  the  condition  codes  I:Q,  (IT,  LT,  OV  or  the 
mask  register  M  fhe  encoding  of  the  Participation  Mode  is  shown  in  the  following  table: 


PM  \aluc 

.Mask/Cundition  Tude 

IMHMIl 

M 

0<M)I0 

EQ 

001(H) 

(.T 

OIO(M) 

LT 

10000 

OV 

.Any  combination  of  the  5  conditions  listed  in  the  table  can  be  u.sed  to  determine  participation.  For  instance,  ifthe  PM  value  is  00110,  the  EQ 
and  (IT  condition  codes  are  ORed  together  to  determine  participation. 

In  addition,  il'the  mask  register  is  u|xlated.  the  participation  mode  register  is  automatically  u|xlated  to  select  M  for  participation. 
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'['he  figure  below  illustrates  an  implementation  of  local  participation  for  data  path  /  (note  that  this  simple  example  is  not  a  complete  imple¬ 
mentation  of  a  participation  bit  and  does  not  include  the  participation  field  bits); 


Participation  Mode  register 


I 

Partici|\ition, 


Figure  14:  Fxampic  of  participation  bit  derived  from  P.M  register  and  condition  codes 


Setting  the  condition  hits  For  simplicitx,  the  WideWord  ALU  performs  conditional  write-backs  (commits  the  results)  on  8-bit  datapaths,  independently  of  the  datapath 
for  participation  width  of  the  instruction.  Conditional  operations  on  16-bit  or  32-bit  data  paths  a-ssume  that  the  condition  bits  for  participation  (condition 

ctxles  or  mask  register)  are  set  consistently  with  the  current  datapath  width.  For  example,  an  instruction  that  operates  on  32-bit  data  fields 
should  have  a  32-bit  result  written  back  to  the  destination  register,  for  each  participating  32-bit  data  field  Therefore,  since  the  WideWord 
•ALU  performs  conditional  w  rite-backs  of  8-bit  values,  the  4  consecutiv  e  bits  of  the  condition  code,  mask  register  corresponding  to  a  32-bit 
datapath  should  be  set  consistently  ( either  all  ones,  for  participation,  or  all  zeros).  It  is  the  programmer's  responsibility  to  ensure  that  the  con¬ 
ditions  for  participation  are  consistent  with  the  datapath  w  idth,  either  by  setting  the  mask  register  or  by  (lerforming  a  previous  operation  with 
the  same  datapath  w  idth  to  set  the  condition  codes. 

Permutation  The  WideW'ord  permutation  network  supports  fast  alignment  and  reorganization  of  data  m  wide  registers.  The  (xirmutation  network  supports 

general  permutations  of  8-bil  data  fields,  that  is.  any  8-bit  data  field  of  the  source  register  can  be  moved  into  any  8-bit  data  field  of  the  des¬ 
tination  register.  A  permutation  is  specified  by  a  pcrmuiaiion  vector,  which  is  a  256-bit  object  containing  32  indices  corresponding  to  the  32 
8-bit  data  fields  of  a  WideWord.  liach  8-bil  field  of  a  permutation  v  ector  corresponds  to  the  same  8-bit  data  field  of  the  destination  register, 
and  contains  the  index  of  the  source  data  field  to  be  moved  into  that  destination  field.  The  figure  below  illustrates  a  permutation  on  8-bil  and 
16-bii  data  paths,  and  the  corres|xinding  permutation  vectors. 
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{{xaiiipic  (a):  shutllc  sequences  of  8  fields,  for  8-bit  data  fields 


source  reg 


dest  rcy 


perm  vector  31,27,30.26,39.2S.38.24,21,l0.32,ie.21.17,20,lS.lS.ll,U.10.  n,09.l3.08,07,03,0«.03,0s,01,04.00 


l  imire  15:  Kxumple  of  permutation  vectors  for  8-bit  and  16-bit  data  paths 

The  WideVVord  supports  two  types  of  permutation  operations,  wprtn  and  wprtnl  In  wprta  the  |x;rmutation  vector  is  in  a  eeneral-purpitse 
wide  register,  allowing  |X’rmutation  sectors  to  be  loaded  from  mentors  and  manipulated  using  WideWord  o|X'rations  wprmi  selects  a  per¬ 
mutation  sector  from  a  lookup  table,  supixstting  faster  permutations  (one  operation)  for  tlie  set  ot' frequently  used  permutation  sectors  in  the 
table.  The  hardssired  permutation  sectors  are  listed  in  the  following  table,  and  the  ixrmute  instructions  are  described  in  more  detail  in  the 
dsKument  "DIVA  IS.'\  Oven  iesv" 

iiidrs  srclor 

0x00  0x000 1 02030405060708090AOBOt0DOE0F  10I11213I4I5I6I718I91A1BIC'IDIE1F 

0x0 1  0x0 1 0203040.S060708090AOBO('0DOEOF  10111213141 5 16I718I91 A  IBIC'IDIEIFOO 

0x02  OX0203O405060708090A0BOC0DOE0F 101112131415161718191 A IBICIDIEI FOOO 1 

0x03  0x03040.S060708090A0B0f0D0E0F1011 12131415161718191  A1B1C'1D1E1F000102 

0x04  0x0405060708090A0B0{0D0E0F10 1 11213141 5 161718191 A IBICIDIEI  FOOO  1 0203 

0x05  0x05060708090A0B0(  0D0E0Fl  0 1 1 1 2 1 3 14 1 5 1 61 7 1 8 1 9 1 A 1 B 1 C 1 D 1 E I  FOOO  1 020304 

0x06  0x060708090A0BO(ODOEOF1011 12131415161718191 A1B1C1D1E1F000102030405 
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index 

vector 

0x28 

0x00080lii')ii;u  \u.5iiBu4iX  i'.si'I>i'i;:f:i'7i.l  10181 119121 AI31BI41C151DI6IEI7IF 

0x29 

0x0001040508090C0D1011 14151819IC1DO2O306O7OA0BOEOFI213 16171 AIBIEIF 

0x2A 

Ox02030001060704050AOB08090EOFOCODI2131011 161714151 A1B18191E1F1C1D 

0x2B 

0x060704050203000 1 0E0F0t0D0A0B0809 1617141 5 121310111E1F1C1DIA1B1819 

0x2C 

OxOEOFOC'ODOAOB08090607040.50203000 1 1 E 1 F 1 C 1 D1 A 1 B 1 8 1 9 1 6 1 7 1 4 1 5 1 2 1 3 1 0 1 1 

0.x2D 

OxlElFlClDlAlB181916171415121310110EOFOCODOAOB08090607040502030001 

0x2E 

0x000 1 04050203060708090C0D0A0B0E0F 101114151213161718191C1D1A1B1E1F 

0x2F 

0x000 1 080902030  A0B04050r0D06070E0F  1 0 1 1 1 8 1 9 1 2 1 3 1 A 1 B 1 4 1 5 1 C 1 D 1 6 1 7 1 E 1 F 

0x30 

0x0001020308090A0B1011 121318191 A1B040506070CODOEOFI41516171C1D1E1F 

0x31 

0x04050607000 1 02030C0D0E0F08090 AOB 141516I71011I2131C1D1E1FI8191A1B 

0x32 

0x0C0D0E0F08090A0BO40506O70OO102031ClDlElF18191AlB141516171O111213 

0x33 

Ox  1 C 1 D 1 E 1 F 1 8 1 9 1 A 1 B 1 4 1 5 1 6 1 7 10 1 1 1 2 1 3OCODOEOF08090 AOB04050607000 1 0203 

0x34 

0x0001020308090A0B0405O607OC0DOE0F1011121318191AlB141516171ClDlElF 

Me  |•^C  The  WideWord  unit  supiKirts  a  siv’cial  instruction  (wmrg)  for  merging  data  from  two  source  registers  according  to  a  given  condition,  ’fhe 

condition  is  sixicified  by  the  VVW  field  of  the  instruction,  and  can  be  one  of  the  condition  codes  EQ,  l.'f  or  (iT,  or  the  M  register  fhe  follow¬ 
ing  table  shows  the  encoding  of  the  W\V  field. 


WAV  \  alue 

fC 

00 

EQ 

01 

LT 

10 

(IT 

II 

M 

The  figure  below  illustrates  a  merge  o|x:ration  using  the  condition  l.T  The  condition  codes  arc  set  b>  a  pres  lous  wsubc  instruction  w  ith  the 
same  data  path  width  as  the  vmirg  instruction. 
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wsubcw  r4 ,  rl ,  r2 
wmigltw  r3,  rl,  r2 


Pacl\/l  npack 
I  ransfcrs 


A  set  of  transfer  instructions  allows  data  to  be  nio\ed  between  the  several  register  files:  ( I )  between  wide  registers  and  general-purpose  sca¬ 
lar  registers;  (2 1  from  wide  register  to  wide  register;  and  (3)  between  general-purpose  integer  registers  and  s|x:cial-purpose  or  protected 
registers.  The  transfer  functions  where  the  source  is  a  scalar  value  (scalar  register  or  a  data  field  in  a  wide  register),  and  the  destination  is  a 
w  ide  register  allow  the  source  data  to  be  replicated  and  stored  into  all  the  fields  of  the  destination 

The  complete  set  of  transfer  instructions  is  listed  in  the  table  below,  and  each  instruction  is  described  m  detail  in  the  document  "DIVA  IS.A 
Overview". 
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name 

mnemonic 

syntax 

uperatinn 

move  from  protected  register 

MFPR 

rD.  pr.A 

flic  contents  of  protected  register  pr.A  are  stored  in  rD. 

move  from  special-purpose  register 

MFSPR 

rD.  spr.A 

The  contents  of  special-purpose  register  spr.A  arc  stored  in  rD. 

move  to  protected  register 

\ITPR 

prD.  r.A 

The  contents  of  r.A  are  stored  in  protected  register  prD. 

move  to  special  purpose  register 

MTSPR 

sprD.  r.-\ 

The  contents  of  r.A  arc  stored  in  special-purpose  register  sprD. 

move  Iroiii  scalai  to  vv  ide 

M\S\V 

wrD,  r.\,  iiuiev 

Some  portion  or  all  of  the  contents  ol  r.A  are  transferred  to  a  siiblicld  of 
wrD,  .starting  at  the  byte  specified  by  the  byte  index.  “ 

MVSWR 

wrD,  r.\ 

The  contents  of  r.A  are  replicated  to  fonn  a  256-bit  value  which  is 
transferred  to  w  rD,  subject  to  participation. 

move  from  scalar  to  wide  indirect 

MVSWI 

wtD,  r.^.  rB 

Some  portion  or  all  of  the  contents  of  r.A  arc  transferred  to  a  siibfield  of 
wrD,  .starting  at  the  byte  specified  b\'  the  low-order  bit  contents  of  rB.** 

move  lioiii  wide  to  scalar 

\I\\\S 

rD.  WT.\.  index 

A  subfield  of  the  contents  of  w  r.A  starting  at  the  byte  specified  by  the  byte 

index  arc  transferred  to  rD. 

nunc  lioin  wide  to  scalar  iiidirccl 

MWVSI 

rD,  wr.\.  rB 

A  siiblicld  olTlic  contents  oI  vvr.A  starting  at  the  byte  specified  In  the  low- 

ordcr  bits  of  the  contents  of  rB  are  transferred  to  rD. 

move  tiom  wide  to  wide 

MVW'W 

wrD.  WT.\.  index 

I  1k’  cnlirc  25('-l'ii  contents  o^^^TA  arc  tiaiistcircd  to  vsrD,  sulijcci  to 
p^trticipation. 

\I\’\\\\R 

wrD,  wr.A.  index 

Ihe  siihficld  ofvvrA  starting  at  the  Inte  specified  by  die  Intc  index  is 
replicated  to  fonn  a  256-bit  value  which  is  transferred  to  wrD,  subject  to 
panicipation.  ' 

move  liom  wide  to  wide  indirect 
replicating 

MVWWRI 

wtD,  vvr.A.  rB 

rile  siihficld  ofvvrA  starting  at  the  byte  specified  by  the  low -order  bits  oi 
the  contents  of  rB  is  replicated  to  form  a  256-bit  value  w  hich  is  transferred 
to  vvrD,  subject  to  participation.^ 

a. Dcpcnding  on  the  size  of  the  data  to  be  transferred,  the  least  significant  hits  of  the  index  may  he  ignored  to  ensure  proper  alignment. 

b. Depending  on  the  size  of  the  data  to  he  transferred,  the  least  significant  bits  of  the  contents  of  i  B  ma\  be  ignored  to  ensure  proper  alignment. 

c. For  data  sizes  less  than  32  bits,  the  high-order  hits  of  rD  are  cleared 
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Chapter  6  -  Memory  Unit 


Introduction 


Memory  Macro 
Description 


Memory  (  ontrollcr 
Description 


This  chapter  presents  the  basic  I'unctionahty  of  the  assumed  [3FtAM  memory  macro  as  well  as  the  essentials  of  a  memory  controller  needed 
with  each  macro  on  a  DIVA  PIM  chip.  1  his  controller  senes  to  arbitrate  amone  the  various  requests  for  access  to  a  DRAM  macro  and  take 
advantaee  of  pane-mode  accesses  wherever  possible. 


A  DRAM  array  similar  to  the  DRAM  macro  provided  by  the  IBM  SA27-Ii  process  is  assumed.  This  macro  exhibits  features  of  typical 
DRAM;  page-mode  accesses,  refresh,  etc.  l.Inlike  conventional  DR.AM.  however,  it  supports  a  full  address  bus.  rather  than  a  row-column 
multiple.xed  one,  and  a  very  wide  256-bit  data  bus.  .Specifically,  the  input  signals  to  the  macro  are:  macro  select  (similar  to  R.AS  in  conven¬ 
tional  DR.AM),  page-mode  select  (similar  to  CAS  in  conventional  DRAM),  write  enable,  refresh  enable,  an  address  bus  where  3  bits  of  the 
bus  are  treated  as  a  column  address,  a  256-bit  input  data  bus.  and  a  256-bit  write  enable  bus.  The  only  output  signals  are  a  256-bit  output  data 
bus.  There  are  also  some  test  input  outputs,  but  these  are  not  crucial  to  the  DIVA  architecture  design.  The  macro  page  size  is  2048  bits;  each 
page  contains  8  distinct  addressable  256-bit  units  of  data  For  an  example  of  the  timing  benefits  of  page-mode  accesses,  the  page-mode  cycle 
time  in  the  SA27-E  technology  is  6.6ns  while  the  random  mode  cycle  time  is  20ns. 


Memory 

Array 


ri"urc  16:  DI\.\  Memory  Controller 


A  diagram  of  the  memory  controller  is  given  in  Figure  16.  The  basic  components  of  the  memory  controller  are  an  arbiter,  a  refresh  timer,  the 
Current  Page  Address  register,  and  the  Memory  Interface.  The  arbiter  is  responsible  for  handshaking  with  all  possible  requesters  of  access  to 
the  memory  array  and  determining  the  priority  of  com|X'ting  requests.  It  communicates  closely  with  the  Memory  Interface,  which  is  respon¬ 
sible  for  generating  all  control  signals  to  the  memory  array,  such  as  address  bits,  macro  select  and  page-mode  select  strobes,  refresh  pulses, 
write  enables,  etc.  Fhe  Current  Page  Address  register  contains  the  address  of  the  page  which  is  currently  held  in  the  sense  amps  of  the  mem¬ 
ory  array. 

The  operation  of  the  memory  controller  is  best  described  by  the  llowchart  given  in  Figure  1 7.  Upon  reset,  the  memory  controller  is  in  an  idle 
state  awaiting  access  requests.  If  any  request(s)  txrcurs.  the  controller  performs  an  arbitration  phase,  fhere  are  two  basic  types  of  requests: 
refresh  and  normal  access  (read  or  write).  If  a  refresh  cycle  is  pending,  it  is  performed.  Note  that  no  address  is  needed  for  refresh  cycles. 
However,  the  refresh  cycle  corrupts  the  sen.se  amps  so  the  contents  of  the  Current  Page  Address  register  are  no  longer  valid. 
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li  the  access  request  that  "wins”  the  arbitration  phase  is  not  a  relresh  request,  i.e  it  is  a  normal  access,  the  address  presented  with  the  request 
is  compared  auainst  the  contents  ot'the  Current  I’aue  .'\ddrcss  register.  a.ssuming  that  a  page  is  currently  open.  If  the  portion  of  the  requesting 
address  which  designates  the  DR.'VM  page  matches  the  \  alue  of  the  Current  Page  .Address  register,  the  access  is  performed  as  a  page-mode 
access,  minimizing  latency  If  the  two  values  are  unequal,  a  random  access  must  be  performed,  which  entails  restoring  the  currently  open 
page  and  strobing  m  the  new  page  corresponding  to  the  access  request.  Simultaneously  with  this  access,  the  new  page  address  is  latched  into 
the  Current  Page  Address  register. 


I'iouiT  17:  ni\.\  Memory  Controller  Mowehart 
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Sources  of  Requests 
and  Arbitration 
Priorities 


Requests  for  normal  accesses,  i.e.  reads  and  writes,  ma\  originate  from  several  sources  within  the  PIM  node.  The  possible  sources  are  the 
host  interface  |iort,  the  processor  instruction  cache,  and  inemorv  stage  of  the  processor  pqxilme.  With  the  possibihtv'  of  these  sources  com¬ 
peting  for  memon,  access,  arbitration  priorities  must  be  formulated.  As  indicated  in  the  llowchart  of  Figure  17,  refresh  cycles  have  the 
highest  priority.  The  following  priority  includes  the  remaining  sources: 

1.  Refresh 

2.  1  lost  interface 

3.  Processor  memory  stage 

•  Processor  instruction  cache 

After  refresh,  the  host  interface  has  the  highest  priority  since  minimal  latency  penalties  for  conventional  DRAM  accesses  are  highly  desired. 
Since  a  memory  request  from  either  the  memory  stage  or  the  instruction  cache  of  the  pnKessor  w  ill  stall  the  processor  piiseline,  the  priority 
between  these  makes  little  dilTerence  If  there  are  requests  pending  from  both,  they  must  both  be  satisfied  before  the  pipeline  can  advance. 
I  lowe\  er,  the  processor  memory  stage  is  assigned  a  higher  priority  to  simplify  the  pipeline  control  logic  since  the  memory  stage  is  deeper  in 
the  pipeline  than  the  instruction  fetch  stage. 

To  request  a  memory  access,  each  of  these  units  must  provide  an  address,  type  of  access  ( read  or  write),  and  data  (for  write  oj'ierations)  In 
addition  to  these  signals,  there  are  handshaking  signals  between  the  arbiter  and  these  units  to  indicate  when  requests  are  ivending  and  when 
they  have  been  granted 
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Chapter  7  -  Instruction  Cache 


Introduction 


Instruction  (  ache 
Description 


Instruction  (  ache 
Organization 


It  IS  ofcritical  importance  to  keep  instruction  fetches  from  interfering  with  the  How  of  operand  data  from  the  node  memones.  In  addition  to 
the  reduction  of  ofierand  data  bandwidth  due  simpl\  to  contention,  instruction  fetches  from  memor\  reduce  bandw  idth  even  further  due  to 
the  resulting  increase  in  memorv  latencv  becau.se  they  disrupt  reference  UKality.  Since  the  code  segment  of  an  application  is  placed  m  a  dif¬ 
ferent  area  of  memory  from  the  data  segment,  interleav  ing  instruction  fetches  w  ith  operand  fetches  from  memory  w  ould  cause  many  random 
memory  accesses  that  could  have  otherw  ise  been  satisfied  in  a  page-mode  fashion.  DIVA  avoids  most  of  the  bandw  idth  losses  by  implement¬ 
ing  a  small  instruction  cache. 


The  DIV'.A  PIM  node  prrx-'essor  contains  a  4-Kby  te,  direct-mapped  instruction  cache.  The  cache  line  size  is  32  bytes,  each  of  which  can  Ive 
loaded  or  invalidated  individually.  In  addition,  the  entire  cache  can  be  invalidated  by  disabling  the  cache.  The  DIVA  architecture  does  not 
support  self-modifying  ctxle,  so  the  instruction  cache  does  not  require  any  write-back  capability,  flic  cache  docs  not  contain  a  smxiping  port 
and  is  therefore  not  kept  coherent  with  memory  automatically.  Kernel  software  is  responsible  for  invalidating  stale  cache  lines  when  the 
backing  memory  for  those  lines  is  being  loaded  w  ith  new  code 


The  cache  consists  of  three  major  eomponents:  core  ram.  tag  ram,  and  the  controller.  A  diagram  show  mg  the  organization  of  the  core  ram  and 
tag  ram  is  shown  m  Figure  1 8.  The  core  RAM  consists  of  128  lines,  where  each  line  is  256  bits  long.  Tach  line  is  then  capable  of  storing  eight 
32-bil  instructions  The  tag  RAM  contains  a  20-bit  tag  for  each  line  of  core  RAM,  although  the  tag  size  could  be  reduced  to  match  the 
amount  of  physical  memory  actually  present  and  thereby  optimize  the  storage  and  fierformance  of  tag  accesses.  Kach  tag  RAM  line  also  con¬ 
tains  a  valid-bit  to  indicate  whether  the  line  contents  is  empty  or  it  actually  contains  valid  information 


■fag  RAM 


Address  Tag 

\- 

• 

• 

• 

T 

o 

i 

1  hit 

Core  R.AM 


Instructions 


rt -  256  bit,s - H 


Figure  IS:  Instruction  (  ache  Oi'oani/ation 


A  phy  sical  address  is  decoded  as  shown  In  Figure  19  for  determining  placement  or  validity  within  the  cache.  Fhe  lea.st  two  significant  bits 
are  ignored  as  they  should  alway  s  be  zero  because  instructions  are  32  bits  in  size  and  aligned  to  32-bit  boundaries.  Bits  27  through  29  are 
u.sed  to  select  a  specific  instruction  within  a  cache  line,  and  bits  20  through  26  are  used  to  specify  the  cache  line.  Fhe  upi'ier  20  bits  are  then 
u.sed  as  the  tag  information  for  a  cache  line.  The  instruction  cache  unit  oivrates  closely  with  the  address  translation  unit.  For  example,  the 
least  significant  12  bits  of  instruction  v  irtual  addresses  are  assumed  to  be  unatfected  by  the  address  translation  priKcss.  I'herefore,  these  bits 
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can  be  used  to  index  into  the  cache  simultaneously  with  the  translation  of  the  upper  20  bits.  By  the  time  the  appropriate  tag  has  been 
accessed,  the  translation  has  taken  place,  so  that  the  tag  contents  can  be  coniixired  with  the  physical  address. 


physical  tag 

line  number 

□ 

instruction 

1  1 

0 

19  20 

I'  ioiire  19:  Instriietion  (  aelie  .Vddress  Interpretation 

26 

27  29 

30  31 

I  nstructitui  (ache  The  operation  of  the  in.struction  cache  is  best  described  by  defining  the  tasks  of  the  cache  controller.  The  controller  is  responsible  for  man- 

Operation  aging  all  actisity  of  the  cache,  including  instruction  fetches  from  the  cache,  loading  cache  lines  from  memorx,  and  iinalidating  cache  lines. 

■fhe  controller  is  basically  a  finite  state  machine  (FSM )  with  three  states,  where  each  state  has  sub-states.  The  FSM  diagram  is  shown  in  Fig¬ 
ure  20. 


Hit  Inv  {'liable 


Figure  20:  (  aclic  (  ontrollcr  Finite  State  Maeliine 
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Cache  (’(Mitrol 
Instructions 


At  priKcssor  boot  time,  the  cache  controller  m  the  disabled  state  In  this  state,  when  the  processor  makes  an  instruction  request,  a  256-bit  data 
item  including  the  desired  instruction  is  fetched  from  the  niemors,  the  requested  instruction  is  selected  from  the  incoming  data,  and  placed 
onto  the  instruction  bus.  All  the  valid  bits  are  also  reset  when  the  controller  enters  the  disabled  state. 

When  code  enables  the  cache,  which  asserts  the  enable  signal,  the  controller  enters  the  normal  state.  In  this  state  two  oiierations  are  fxissible: 
read  and  invalidate  During  a  read  operation  the  controller  (xirforms  an  instruction  fetch  by  comparing  the  tag  I'Hirtion  of  the  supplied  address 
w  ith  the  tag  of  the  appropriate  line  of  the  tag  R.AM.  If  the\  match  and  the  \  alid  bit  is  set.  then  the  desired  word  is  selected,  placed  onto  the 
instruction  bus,  and  the  hit  signal  is  a.sserted.  Otherw  ise,  the  hit  signal  is  negated,  and  the  controller  enters  the  memory  serv  ice  state  If  the 
IN'V  signal  is  high,  then  the  valid  bit  of  the  cache  line  specified  by  the  instruction  address  is  reset  if  the  tag  of  the  address  matches  the  tag  of 
the  line. 

The  memory  serv  ice  state  is  very  similar  to  the  disabled  state.  The  only  ditference  is  that  w  hen  the  data  is  fetched  from  the  memory  ,  it  is  also 
written  to  the  appropriate  core  R.AM  line,  the  tag  is  written  to  the  corresponding  line  of  the  tag  RAM.  and  the  valid  bit  of  that  line  is  asserted. 

The  only  cache  control  instruction  supported  by  the  DIV'A  instruction  set  is  the  icii  (instruction  cache  line  invalidate)  instruction  This 
instruction  supplies  an  address  using  the  register  plus  olfset  addressing  mtxie  If  the  address  is  found  m  the  cache,  the  corresponding  cache 
line  is  invalidated. 


204 


I  lartl«  arc-N  ectored 
Kxc'cptioiis 


Chapter  8  -  Exceptions 

This  chapter  delines  the  exceptions  and  exception-handling  mechanism  for  the  DIVA  I’lM  node.  Exceptions,  arising  from  execution  of  node 
instructions,  and  interrupts,  from  other  sources  such  as  an  internal  timer  or  external  interrupt  signal,  are  handled  bx  a  eommon  mechanism. 
For  the  most  part  this  dtKument  will  refer  to  both  exceptions  and  interrupts  as  exceptions. 

fraditionalK'  RISC  processors  have  had  relativeh'  primitive  mechanisms  for  exception  handling  compared  to  CISC  processors  which  max 
have  multiple  stack  registers,  e.xtensive  hardxvare-suppxirted  vectoring  and  priority-level  controls  of  enabling  exceptions.  Even  with  these 
supporting  hardxxare  features,  it's  common  to  find  problems  of  priority  inx  ersion  and  stack  management  errors  m  interrupt-serx  ice  softxxare. 
Errors  in  priority  assignment  are  not  easily  fixed  once  ca.st  in  hardware.  I-Aception  handling  hardxxare  is  dilTicult  to  implement  and  integrate 
xvith  high-performance  hardware 

The  exception  handling  scheme  I'or  DIVA  has  a  modest  hardxxare  requirement,  exptirting  much  of  the  complexity  to  softxxare,  xxhich  is  easier 
to  mend.  It  does  provide  an  integrated  mechanism  for  handling  hardxxare  and  softxxare  exception  sources.  Additionally,  it  prox  ides  a  flexible 
priority  assignment  scheme  xxhich  minimizes  the  amount  of  time  that  exception  recognition  is  disabled.  While  the  hardxxare  design  supports 
traditional  stack-based  exception  handlers,  xve  also  outline  a  non-recursix e  dispatching  scheme  xxhich  uses  DIVA  hardxxare  features  to  alloxv 
preemption  of  loxver-priority  exception  handlers  using  a  mechanism  xxhich  should  be  easier  to  debug 

The  DIVA  node  processor  must  respond  to  a  variety  of  exceptions  due  to  internal  instruction  prxKessing  conditions  and  interrupts  due  to 
external  stimuli  The  PIM  node  processor  has  only  four  hardxx are-vectored  exceptions,  all  others  are  dispatched  by  softxxare  xvith  some  hard¬ 
xxare  assistance.  The  exceptions  are  listed  in  descending  priority  order. 


I'.MU.I',  3.  IIaidxxare-\'ectorcd  I'.xccptions 


Exception 

Vector 

Address 

Notes 

Hard  RESET 

TBD 

Poxxer-on  clear  and  or  diagnostics 

Soft  Ri:SET 

0.X08000000 

El.xternal  reset 

Undefined  Instruction  (incl.  BRK) 

0x08000100 

Softxvare-vectored  exceptions 

0x08000200 

The  asMf’iwieiil  of  u  vector  luklres.s  to  the  hard  RESET  exception  depends  on  a  specification  oj  the  initial  prop,ram  load  hooistrap 
mechanism. 

Note  that  the  three  vector  addresses  other  than  the  hard  RESET  point  to  exception  handler  routines  located  at  the  start  of  node  DRAM,  so  the 
node  DRAM  must  be  initialized  and  functional  for  any  operation  bey  ond  hard  RIvSE'f. 

.All  e.xceptions  other  than  reset  and  undefined-instruction  exceptions  are  vectored  by  hardxxare  to  the  catchall  '•softxxare-veclored  exception" 
handler,  xxhich  examines  the  exception  source  xvord  to  perl'orm  a  softxvare-vectored  dispatch  to  the  appropriate  exception  handler 
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llarihvaiT  Support 
for  llardwarc- 
\  ec to rctl  Inceptions 


The  node  processor  has  several  privileged  registers  and  a  priv  ileged  instruction.  RFt;,  used  to  return  from  exception  handlers  to  normal 
prtKessing. 

All  exceptions  oiverate  in  supers  isor  mode.  The  program  counter  and  processor  status  words  are  copied  to  privileged  temporars  registers 
before  exception  prwessing  is  begun  The  exception  handling  code  runs  m  the  same  address  map  as  the  preceding  code.  Other  state  changes 
are  performed  at  the  exception  handler  if  necessars.  Other  registers  are  set  by  specific  exception  conditions,  e  g.,  M.'\DR  is  set  in  the  esent 


1 1  a  rtl  w  a  rc-N  cct  o  I'cd 

Kxceplion 

Descriptions 


r.VBI.K  4.  Ilanissarc  State  at  Start  of  T\ception  Processing 


Register 

Field 

\alue 

Notes 

PSW 

MD 

0 

Mode  IS  set  to  superv  isor,  other  fields  unchanged 

PC 

handler 

Address  of  exception  handler 

FADR 

old  PC 

Address  of  faulting  instruction  or  next  instruction 

SSW 

old  PSW 

Sav  ed  copy  of  prior  PSW 

of  a  memoiy -access  e.xception.  The  exception  source  word  is  set  to  indicate  the  cause  of  all  but  the  reset  and  undelined-instruction  e.xcep- 
tions,  which  are  implicitly  identified  by  the  hardware  vectoring  to  associated  exception  handlers,  fhe  exception  source  word  and  its 
associated  enable  mask  register  are  discussed  at  more  length  in  the  "Software- Vectored  Txceptions"  section  The  reset  and  undefined- 
instruction  exceptions  ma\  not  be  disabled.  All  other  exceptions  ma\  be  disabled  in  aggregate  by  setting  a  bit  in  the  PSW  or  selectiveK,  b> 
setting  a  bit  m  the  exception-enable  mask  register. 

Upon  completion  of  exception  handling,  the  RFI:  instruction  will  cop\  the  FAIDR  to  the  P(.'  and  the  SSW  to  the  PSW  to  resume  normal  pro¬ 
cessing.  Depending  on  the  cause  of  the  exception,  the  FADR  may  point  to  the  instruction  that  caused  the  exception,  if  the  e.xception 
prev  ented  the  instruction  from  completing,  or  to  the  next  instruction  in  the  code  sequence,  if  the  prior  in.struction  did  complete.  For  example, 
a  memory  access  fault  would  load  the  FADR  with  the  address  of  the  load  or  store  instruction  which  caused  the  access  exception,  while  a 
timer  interrupt  or  external  interrupt  would  load  the  FADR  with  the  next  instruction  to  be  executed.  Fhe  e.xception  handling  code  is  responsi¬ 
ble  for  adjusting  the  F.'\DR  as  needed  prior  to  executing  the  RIT-  instruction.  Depending  on  the  nature  of  the  exception,  the  faulting 
instruction  may  be  retried,  for  example  a  WideWord  instruction  after  a  lazy  register  save,  or  a  memorv  access  instruction  after  an  address- 
translation  adjustment. 

The  node  prrKessor  pros  ides  four  scalar  system  scratch  registers  to  be  used  by  e.xception  handlers.  l:.xception  handling  code  requiring  more 
registers  are  responsible  for  saving  and  restoring  node  prtKessor  registers  as  needed. 

Hard  KUSK  I  lOx  I  BD) 

This  exception  provides  a  starting  point  for  power-on  initialization  and  (optionally)  self-test  and  diagnostic  functions  for  the  node  It  can  be 
triggered  by  internal  (vovver-on  detection  circuitry  or  an  external  source.  At  the  conclusion  of  initialization  and  testing  the  node  processor  is 
ready  for  initial  program  loading  hy  a  mechuiusm  THl).  This  mechanism  may  he  simplified  for  a  node  allaehed  to  a  hasi  via  ils  system 
iiilerfaee. 
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In  contrast  to  the  soft  reset,  which  generally  presers  es  present  hardware  state  and  register  contents,  a  hard  reset  will  set  certain  hardware  to 
a  known  state  to  allow  straightforward  initiali/ation. 


rABI.r  5.  PS\>  State  at  Hard  or  Soft  RI'.SK  T 


Bit 

\aluc 

Notes 

0 

MD 

0 

Mode  is  set  to  superv  isor 

1 

X 

Reserv  ed 

2 

IC 

0 

Instruction  cache  is  disabled 

3 

EE 

0 

lException  recognition  is  disabled 

4 

0 

VVideWord  instruction  priK'essing  is  disabled 

5 

EP 

0 

Floating-Point  Instruction  processing  is  disabled 

6-7 

Unused 

X 

Reserv  ed 

8 

lA 

0 

Instruction  address  translation  is  disabled 

9 

DA 

0 

Data  address  translation  is  disabled 

X 

Reserv  ed 

Soft  Kr.SFT  (OxOSIIOOIIIIO) 

After  a  kernel  or  monitor  program  has  been  loaded  into  functional  DRAM,  the  external  RHSRT  input  causes  instruction  execution  to  begin 
at  this  DRAM  address.  It  is  anticipated  that  this  RESET  can  be  triggered  either  b\  an  external  input  or  by  the  host  prtK'esstrr  accessing  the 
node  via  its  system  interface. 

It  IS  ex|vected  that  a  soft  reset  handler  will  dump  a  detailed  snapshot  of  node  status  to  memory  to  aid  debugging  before  reinitializing  the  ker¬ 
nel  or  monitor  data  structures. 

ITidcniicd  Instruction  <(KI)S()()IM()()) 

This  vector  serv  ices  all  undefined  instruction  exceptions  and  also  serv  es  as  the  primarv  exception  handler  for  breakpoint  instructions.  Break¬ 
point  instructions  are  implemented  bv  a  software  convention  defining  one  or  more  undefined  instruction  opcodes  as  BRK^.  fhe  l  ADR 
register  |x>ints  to  the  address  of  the  undefined  instruction.  To  allow  the  BRK.  mechanism  to  debug  exception  handling  code,  we  adopt  the 
conv  ention  that  SR3  is  reserv  ed  exclusively  for  u.se  by  this  exception  handler,  vv  hich  does  not  use  other  scratch  registers.  This  is  not  adequate 
to  allow  use  of  BRK  prior  to  copv mg  of  FADR  and  SSW  however. 

Software-vectored  exeeptions  (0x08001120(1) 

This  vector  prov  ides  the  initial  exception  handling  for  all  other  exceptions  and  interrupts  in  the  system.  Recognition  of  this  aggregate  excep¬ 
tion  mav  be  disabled  bv  priv  ileged  ctxle  altering  the  PSW  and  is  automatically  disabled  upon  exception  recognition,  to  remove  anv  hardware 
requirement  to  support  nested  exceptions. 
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So  tt«  a  rc-N  ccto  I’cd 
Kxccplions 


Lij>luweiglit  Exceptions 


Heavyweight  Exceptions 


Primary  Exception 
Handlers 


Secondary  Exception 
Handlers 


llarciMurc  Support 
for  Softwarc- 
Nectorcil  Kxccptions 


Most  exception  sources  in  the  DIVA  PIM  are  sciMced  by  a  software-vectored  exception  handler.  Determination  of  the  exception  cause 
requires  examination  of  the  32-bit  exception  source  word,  w  hich  constantly  monitors  hardware  which  may  cause  exceptions  and  also  pro- 
\  ides  the  ability  for  software  to  trigger  exceptions 

Nested  exceptions  can  be  supi>orted  if  the  exception  handler  saves  essential  state,  notably  FADR  and  SSW.  prior  to  reenabling  exceptions. 
The  softw are-vectored  exception  handling  procedure  supports  nesting  of  e.xceptions  for  some  potentially  lengthy  handlers  by  splitting  the 
exception  handler  into  primary  and  secondary  parts  Primary  exception  handlers  are  non-interruptiblc  except  for  reset  and  undefined-instruc¬ 
tion  exceptions.  Secondary  exception  handlers  may  be  interrupted  by  other  exceptions.  They  may  or  may  not  be  re-entrantly  interrupted  by 
other  instances  of  the  same  exception  ty  pe,  depending  on  the  handler  code  treatment  of  the  mask  register. 

Lightweight  e.xceptions  are  those  which  can  be  ser\  iced  completely  within  the  primary  exception  handler,  and  do  not  require  sa\  ing  of  tran¬ 
sient  exception  state  I  lardware  disables  further  exceptions  until  reenabled  by  execution  of  RFE. 

.'\n  example  of  a  lightweight  exception  is  the  timer  tick  exception,  which  increments  a  counter  in  memory.  If  the  tick  does  not  end  a  sched¬ 
uling  quantum,  no  further  processing  is  required.  If  the  tick  does  end  a  scheduling  quantum,  it  triggers  a  quantum-expiration  e.xception.  but 
does  no  further  priKessing  itself 

Heavyweight  exceptions  are  those  which  cannot  be  seix  iced  entirely  within  a  primary  exception  handler.  The  primary  e.xception  handler 
saves  necessary  exception  state  m  one  of  three  liKations.  femiwrary  use  is  made  of  the  sy  stem  scratch  registers.  PriKcssor  context  is  saved, 
as  necessary,  m  a  register  save  area  in  a  fixed-location  memory  area  common  to  all  primary  exception  handlers.  Information  specific  to  the 
particular  exception,  w  hich  is  required  for  later  processing  by  the  secondary  exception  handler  is  saxed  in  a  fixed-liKation  memory  area  spe¬ 
cific  to  that  particular  exception  ty  pe. 

Primary  exception  handlers  perform  all  of  the  processing  for  lightweight  e.xceptions  and  the  initial  time-critical  portion  of  heavyweight 
e.xceptions. 

The  environment  of  primary  exception  handlers  is  highly  constrained  They  may  use  the  sy  stem  scratch  registers  SR0-SR3  freely  but  must 
save  and  restore  any  other  (iPRs.  Primary  handlers  may  call  other  routines  conibrniing  to  the  constraints,  but  must  use  the  exception  stack, 
which  is  located  at  the  top  of  the  kernel  stack  segment.  Calling  a  subroutine  m  the  primary  exception  handler  environment  requires  initializ¬ 
ing  the  stack  |>omter  to  the  fixed  top  of  the  e.xception  stack  area.  Primary  handlers  are  w  ritten  in  assembly  language. 

Secondary  e.xception  handlers  perform  the  non-initial  processing  of  heavyweight  exceptions.  They  may  not  u.se  the  system  scratch  registers 
SR0-SR3,  since  exceptions  are  enabled  during  most  of  the  execution  of  the  secondary  handler.  Secondary  handlers  may  be  written  in  a 
restricted  subset  of  the  C  language.  Secondary  handlers  are  w  ritten  in  a  sty  lized  form  prov  iding  functions  to  suspend  and  resume  their  pro¬ 
cessing  if  preempted  by  higher  priority  exceptions. 

All  software-vectored  e.xception  sources  have  an  associated  bit  defined  in  the  32-bit  e.xception  source  word,  HSW,  and  corresponding  bits  in 
the  exception-enable  mask  register,  LMR,  the  e.xception  set  register,  ESR.  and  the  e.xception  reset  register.  ERR.  When  a  software-vectored 
exception  is  recognized,  the  global  exception  enable  bit  in  the  processor  status  word,  PSW,  is  cleared,  so  that  hardware  events  which  cause 
changes  to  the  ESW  cannot  trigger  a  nested  e.xception  Reset  and  undefined  instruction  exceptions  may  preempt  primary  e.xception  handling 
cixle,  but  other  exceptions  will  not  be  recognized. 

The  e.xception  source  word  is  a  32b  register  recording  e.xceptions  initiated  Iwth  by  hardware  and  software  sources.  Hardware-source  bits  in 
the  exception  source  word  may  be  set  to  one  by  hardware  conditions,  such  as  a  pbuf  interrupt,  while  software-source  fields  are  set  by  soft- 
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ware  writing  a  one  to  the  corresi'Hinding  bit  location  in  the  exception  set  register.  Once  set,  a  bit  In  the  exception  source  word  ean  be  cleared 
only  b\'  w  riting  a  one  to  the  corresponding  bit  of  the  exeeption  reset  register  Although  labeled  registers,  both  l:SR  and  l:RR  are  really  reg- 


r.VBI -K  6.  r.xccption-Rchited  Registers 


Name 

l’R= 

Description 

Exception  Source  Word  (ESW) 

8 

Specifies  sources  of  exceptions 

E.xception  Enable  Mask  Register  (I;MR) 

9 

Bitwise  exeeption  enabling  mask.  1  =  enabled 

Exception  Set  Register  (ESR) 

10 

W'rite  1  to  set  corresp<mding  bit  in  source  word,  SW'- 
souree  fields  only 

Exception  Reset  Register  (ERR ) 

1 1 

Write  1  to  clear  corresponding  bit  in  word  register 

ister-address  triggering  functions.  That  is,  a  one  written  to  an\  bit  in  either  of  these  registers  causes  an  immediate  and  one-time  elfect  on  the 
corresfxinding  bit  in  the  exception  souree  word;  liSR  and  l;RR  do  not  maintain  an\  state. 

Bits  in  ESVV  are  alTected  by  hardw  are  conditions  and  ESR  and  l:RR  actions  regardless  of  settings  of  the  exception  enable  mask  register, 
I:MR.  The  bits  of  EMR  merely  enable,  or  disable,  corresponding  bits  of  ESW  to  cause  exceptions.  Therefore,  there  is  a  global  exception 
enable  control  \  la  the  exception  enable  bit  in  PSW  and  individually  ma,skable  controls  for  each  bit  of  the  ESVV  via  the  EMR. 

The  Exception  Source  Word  has  32  pcrssible  hardware-  and  software-initiated  exception  sources.  The  prioritv  of  the  sources  decreases  with 
increasing  bit  number. 


T.VBI.E  7.  Exception  Source  Word 


Exception  Name 

Initiator 

Description 

Watchdog  fimer 

I IW 

0 

Mav  not  be  implemented 

Ehimapped  Instruction  Access 

IIW 

1 

Instruction  access  not  within  segment  boundaries 

Invalid  Instruction  Access 

IIW 

*) 

Instruction  access  not  permitted 

l.himapped  Data  Access 

IIW 

3 

Data  access  not  within  segment  boundaries 

Invalid  Data  Access 

IIW 

4 

Data  access  not  permitted 

PBuf  Receive  Interrupt 

I  IW 

5 

PBuf  Send  Error 

IIW 

6 

Interval  fimer 

IIW 

7 

l  ick  counter 

WideVV'ord  Not  Available 

IIW 

8 

WideWord  instructions  attempted  without  enable 

Floating  Point  Not  Available 

IIW 

9 

Floating-point  instructions  attempted  without  enable 

Address  Fault  Fix-up 

SW 

10 

Received  Packet  Processing 

SW' 

11 

Send  lirror  Processing 

SW 

12 
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Dispatch  of 
Soflwarc-Ncctorcil 
IC\ccption  llaiullcrs 

Dispatch  to  the  primary 
handier 


TAEtlJ'  7.  Kxccption  Source  Mord 


Exception  Name 

Initiator 

Biti¥ 

Description 

Reserved 

SW 

13 

1  lost  Interrupt 

IIW 

14 

Mav  not  be  implemented 

FP  Div  ide  by  Zero 

IIW 

15 

Refer  to  FPSR  description 

Host  Interrupt  Processing 

SW 

16 

FP  Unsupported  Value 

IIW 

17 

Refer  to  FPSR  description 

Context  Swapper 

SW 

18 

System  Call 

IIW 

Id 

Priv  ileged  Instruction  Violation 

IIW 

20 

Scalar  Integer  ALU  li.xception 

IIW 

21 

Wide  Word  Integer  ALU  lixception 

IIW 

FP  Inexact  Invalid 

IIW 

23 

Refer  to  FPSR  description 

Integer  ALU  Fi.x-up 

SW 

24 

Wide  Word  ALU  Fi.x-up 

SW 

25 

Floating  Point  Fi.x-up 

SW 

26 

Reserv  ed 

SW 

27 

Lock  Buzzer 

SW 

28 

May  not  be  implemented 

Thread  Rescheduler 

SW 

2d 

Thread  Dispatcher 

SW 

30 

Return  to  User  Mode 

SW 

31 

Full  register  restore  as  necessary 

The  DIVA  PIM  node  exception  handling  mechanism  requires  little  specialized  hardware  support  and  supprirts  preemption  of  lengthy  low  pri¬ 
ority  handlers  without  requiring  LIFO  processing  due  to  stack  mechanisms.  Dispatch  is  always  to  the  highest  priority  exception  handler. 
There  is  no  possibility  of  pathological  stack  growth  under  high  rates  of  exceptions.  System  overload  due  to  design  problems  w  ill  manifest  as 
ov erruns,  w  hich  can  be  evident  and  recoverable,  rather  than  stack  explosion,  w hich  is  tv  pically  obscure  and  fatal. 

A  new  exception  condition  will  be  recognized  if  exceptions  are  enabled  m  the  PSW  and  if  the  particular  source  is  enabled  by  the  mask  reg¬ 
ister  fhe  hardware  begins  execution  of  code  at  the  software-vectored  exception  vector  address.  Exceptions  are  disabled  m  the  new  PSW. 
Since  the  primarv  handlers  are  non-recursive  and  run  to  completion.  prvK'essor  state  can  be  saved  to  a  reserved  temporarv  area  at  a  fi.xed 
address  (rather  than  a  true  stack)  as  needed  by  the  particular  handler. 
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Cinupletioii  of  ii  primary 
handler 


Dispatch  of  a  secondary 
handler 


The  exception  source  word  is  copied  into  a  scalar  (iPR  and  the  ELO  instruction  is  used  to  encode  the  bit  number  of  the  leftmost  (smallest 
numbered)  set  bit.  This  operation  selects  the  highest  priority  source.  The  encoded  source  bit  number  is  used  as  the  index  into  a  vector  of  han¬ 
dler  addresses,  and  the  processor  branches  to  that  primarv  handler. 

The  selected  primary  handler  determines  whether  the  e.xception  is  lightweight  enough  to  be  handled  in  the  primary  handler  or  whether  addi¬ 
tional  pnxressing  must  be  deferred  to  the  secondary  handler. 

If  the  primary  handler  can  complete  the  exception  processing,  it  does  so  and  then  restores  the  saved  (iPRs  and  status  before  reenabling 
exception  recognition  by  executing  the  RFE  instruction.  Prior  to  completion  it  will  reset  its  assiKiated  e.xception  source  bit. 

If  the  primary  handler  cannot  complete  the  e.xception  prrtcessing,  it  will  copy  the  necessary  state  to  a  structure  associated  with  its  secondary 
handler,  and  set  the  bit  associated  with  the  secondary  handler  by  writing  to  the  exception  set  register.  After  restoring  sa\ed  (iPRs  and  status 
and  resetting  its  source  bit.  it  reenables  exception  recognition  by  executing  the  RFE  instruction.  The  highest-priority  exception  source  will 
subsequently  be  recognized  and  begin  exception  processing,  fhis  may  be  the  secondary  handler  just  scheduled  ora  higher  priority  hardware 
or  software  exception  handler 

lie  can  optimize  resioriiif’  (ll'Rs  hy  delaying,  tins  iiiilil  all  secondary  handlers  have  coinpleled  and  no  further  exceptions  are  present.  The 
deferred  (il'R  restore  can  he  aceoinplislied  hy  .selliiif’  a  software  e.xceplioii  for  the  "reliirii  to  user  mode"  handier.  It  remains  to  he  .seen 
whether  this  is  ftenerally  advantaneous  for  performance. 

The  initial  part  of  the  software  vectoring  of  a  secondary  handler  is  the  same  as  a  primary  handler.  At\er  the  handler  branches  to  the  specific 
secondary  handler  code,  the  secondary  handler  is  required  to  perform  more  elaborate  state  sa\  mg  due  to  the  possibility  of  preemption  by 
higher-priority  sources.  The  first  portion  of  the  secondary  handler  runs  w  ith  exceptions  disabled. 

When  a  secondary  handler  begins  execution  it  installs  a  I'Hmiter  to  its  environment  structure  in  the  privileged  register  SR2.  If  the  prior  \  alue 
of  SR2  IS  zero,  it  is  not  preempting  another  secondary  handler.  If  the  prior  \  alue  is  nonzero,  it  is  preempting  another  lower-priority  secondary 
handler.  To  preempt,  the  current  handler  saves  the  state  of  the  prior  secondary  handler  by  calling  its  suspend  routine,  the  address  of  which  is 
at  a  fixed  olTset  within  the  environment  The  suspend  routine  copies  the  necessary  state  into  the  environment  and  returns.  The  environment 
will  typically  hold  only  one  instance  of  a  gi\en  type  of  suspended  secondary  handler,  fhis  means  that  while  exceptions  can  interrupt  and  pre¬ 
empt  secondary  handlers  of  a  dilTerent  type,  we  don't  supiwrt  reentrant  handling  of  multiple  exceptions  of  the  same  type.  While  it  is  a 
straightforward  extension  to  support  a  per-type  stack  or  queue  of  multiple  exception  instances,  in  most  circumstances  the  inability  to  com¬ 
plete  exception  processing  prior  to  encountering  a  subsequent  exception  of  the  same  type  rellects  an  underlying  system-design  problem. 

■fhe  handler  is  coded  to  record  its  essential  state  at  periodic  intervals.  In  elfect.  it  stores  a  checkpoint  record  of  its  progress  in  its  env  ironment 
w  ith  sulTicient  detail  to  allow  processing  to  resume  in  the  event  of  a  preemption.  A  technique  sulTicient  to  maintain  atomicity  is  to  “double 
bulfer"  a  structure  w  ith  essential  information  and  “tlip"  between  the  consistent  and  working  copies  w  ith  a  write  to  an  index  or  pointer  vari¬ 
able.  Code  progress  can  be  recorded  by  using  a  state  variable  for  a  software  state  machine  or  by  updating  function  pointers. 

In  contrast  to  a  traditional  stack-based  sy  stem,  which  keeps  activation  records  on  a  stack  which  mu.st  be  unwound  in  a  LIFO  order,  our  dis¬ 
patch  scheme  records  the  activation  of  the  handler  by  a  bit  in  the  exception  source  vector,  while  storing  the  associated  saved  state  of 
preempted  handlers  in  handler-specific  environment  structures.  This  ensures  completion  of  handlers  in  priority  order  without  requiring  hard¬ 
ware  supivort  of  multiple  priority  lev  els  for  e.xception  recognition.  It  may  also  reduce  the  amount  of  saved  state.  The  handler  itself  can  be 
ctxied  to  record  the  hare  minimum  of  state  to  allow  a  resumption,  rather  than  being  forced  to  assume  the  worst  case  and  save  entire  register 
sets  which  may  or  may  not  have  been  altered.  This  is  particularly  significant  for  the  large  register  sets  of  the  DIV.'\  PIM  node. 
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Completion  of  a 
.secondary  handler 


S()lt«  arc-N'cctorcil 

Exception 

Descriptions 


The  "checkpoinlcd"  exceptions  scheme  is  much  easier  to  debue  \  ia  an  interactive  debueger  or  memorv  dump,  since  the  state  of  each  active 
exception  handler  is  recorded  at  fixed  locations  in  a  form  which  may  be  conveniently  examined  as  a  high-level  structure,  fhis  is  in  contrast 
to  a  preemptive  stack-based  record,  where  the  states  of  several  handlers  may  be  distributed  in  their  lowest-level  bindings  across  large  chunks 
of  stack  at  highly  variable  Iwations 

fbe  secondary  handler  completes  by  reimtiali/ing  its  checkpoint  record  to  its  starting  state,  resetting  its  associated  exception  source  word 
bit.  and  e.xecuting  an  RI'E. 

If  no  other  exception  is  recognized,  the  lowest-priority  software  exception  will  restore  all  disturbed  register  states  and  return  to  user  mode 
C(xle. 

Hacb  exception  needs  to  have  detailed  the  relev  ant  hardware  status  registers  and  m  particular  the  correct  interpretation  of  FADR.  This  will 
require  some  iteration  to  conv  erge  on  a  good  set  of  hardware  exceptions. 

The  description  of  software  exceptions  is  less  helpful,  since  the  details  of  the  runtime  kernel  will  define  both  the  priority  and  meaning  of 
software-initiated  secondary  exceptions. 
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Chapter  9  -  Parcel  Buffer 


Introduction 


Parcel  I'ormat 


The  communication  abstraction  in  DIVA  is  a  ptirccl  (i’Aiia/k’l  ('<>mpiiiinf>  El.enwiUt.  A  parcel  is  closely  related  to  an  acti\e  message  as  it  is 
a  relatively  lightweight  communication  mechanism  containing  a  reference  to  a  function  to  be  invoked  when  the  parcel  is  received.  Parcels 
are  distinguished  from  acti\e  messages  in  that  the  destination  of  a  parcel  is  an  ohjcci  in  incnmn  ,  not  a  specific  processor.  From  a  program¬ 
mer's  Slew,  parcels,  together  with  the  global  address  space  supported  m  DIVA,  proside  a  compromise  between  the  ease  of  programming  a 
shared-memory  system  and  the  architectural  simplicity  of  pure  message  passing.  Remote  operations  or  accesses  can  be  accomplished 
through  parcel  sends  and  receiv  es,  application  programs  need  si'iecifv  only  the  address  of  an  object,  and  not  the  processor  upon  which  the 
object  resides. 

The  basic  mechanism  used  in  the  DIVA  system  to  support  parcel  sending  receiving  from  to  an  application  is  a  parcel  buffer  (or phiif).  The 
pbuf  has  a  virtual  as  well  as  a  physical  abstraction.  To  the  application,  the  pbuf  locations  appear  as  regular  memorv  locations  that  are  mani|v 
ulated  through  simple  loads  and  stores.  .'\t  a  phvsical  level,  the  pbuf  is  a  set  of  memorv -mapped  registers.  Fach  PIM  node  contains  a  pbuf 
that  serves  as  a  port  between  the  on-chip  parcel  interconnect  and  the  node.  (  The  on-chip  parcel  interconnect  connects  node  pbufs,  the  host 
interface  pbuf,  and  the  PIM  Routing  Component,  or  PiRC.)  Although  the  parcel  buffer  could  be  implemented  as  registers  within  the  PIM 
nrxle  processor,  a  memory-mapped  mechanism  for  the  parcel  bulfer  allow  s  a  uniform  implementation  for  the  node's  pbuf  as  w  ell  as  a  host 
pbuf  I  lence,  a  pbuf  w  ithin  the  PIM  chip  host  interface  is  memorv -map|x:d  into  the  host  prtKessor's  address  space  to  allow  the  host  processor 
to  communicate  with  PIM  nodes  via  the  parcel  mechanism 

fo  launch  parcels,  a  user  simply  wntes  to  appropriate  fields  in  the  pbuf  To  receive  parcels,  users  mav  either  use  an  interrupt  or  polling  meth- 
odologv  to  know  when  to  read  parcels  from  the  pbuf  Parcel  bulfer  access  is  managed  by  the  system  in  the  same  fashion  as  any  other  region 
of  memorv  in  the  node's  local  address  space  (refer  to  C.’hapter  10). 

The  phvsical  parcel  format  is  shown  in  Figure  21.  A  parcel  consists  of  a  96-bit  header  and  256-bit  payload.  Most  of  the  parcel  contents  are 


header 


route  source  eid  int  Icmcl  object 


16-bit  16-bit  16-bit  8-bit  8-bit  32-bit 


pav load 


arguments 


256-bit 
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written  by  the  user  during  a  parcel  launch,  however,  the  system  is  responsible  for  generating  the  route,  source,  eid.  and  int  fields.  The  route 
field  is  a  16-bit  value  that  is  used  by  the  PiRC  to  direct  a  parcel  to  the  correct  PIM  chip  and  node  The  source  field  is  a  16-bit  value  that  rep¬ 
resents  the  node  ID  of  the  sender  and  can  be  used  by  kernel  software  to  correct  routing  errors.  The  eui  field  is  the  16-bit  environment 
identifier  of  the  process  that  launched  the  parcel,  and  the  8-bit  ini  field  indicates  whether  the  parcel  should  generate  an  interrupt  at  the  receiv¬ 
ing  pbuf.  The  oh/eci  field  is  a  32-bit  v  irtual  address  of  the  object  to  which  the  parcel  is  directed,  and  the  cinJ  is  an  8-bit  identifier  that  the  user 
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Parcel  Buffer 
Add  ressin” 


can  use  to  index  into  a  table  of  commands  for  the  specified  object.  The  256-bit  pa\  load  consists  of  arguments  for  the  command  task  or  other 
data  associated  with  the  action  specified  b>  the  parcel. 

Data  is  written  to  or  read  from  the  pbuf  in  256-bit  increments  via  the  WideWord  Unit  registers.  The  pbul' address  space  can  then  be  viewed 
as  a  set  of  256-bit  registers  Besides  the  header  and  pav  load  registers,  there  are  also  status  and  configuration  registers.  .Although  the  ixiyload 
is  the  only  true  physical  256-bit  register,  each  register  is  alliKated  256  bits  of  the  address  space  and  is  aligned  to  the  least  sigmtlcant  bit 
boundaiy.  For  example,  the  %-bit  header  is  aligned  to  the  least  significant  %  bits  of  the  256-bit  register  space  it  is  allocated.  At  least  two 
register  sets  are  needed:  one  for  sending  and  one  for  receiving.  In  addition,  it  is  desirable  to  have  multiple  address  mappings  (aliases)  of  these 
sets  to  supfHvrt  ditferent  access  privileges  and  modes,  as  described  later  do  minimize  interface  issues  with  standard  CPU  cache  line  sizes  in 
supporting  this  feature.  256  bytes  of  address  space  are  allocated  for  each  virtual  copy  of  a  pbuf  register  set,  with  the  256  bytes  distributed  to 
eight  256-bit  registers.  The  256-bvte  register  set  space  for  the  pbuf  send  side  is  shown  in  Table  8  The  payload  and  header  are  as  described 

lAliLK  8.  Parcel  Buffer  Seiiil  Register  Set 


.Vddress  Relative  to 
Register  Set  Base 

Register  Description 

Physical  Size  (bits) 

-Vccess  Privilege 

OxOU 

payload 

2.56 

SLiperv  isor  user 

()\:() 

licadci 

96 

supervisor  limited  user 

0x40 

status 

3 

supervi.sor  limited  user 

0x60 

reserv  cd 

NA 

NA 

0.\8(l 

source 

16 

superv  isor 

OxAO 

cid 

16 

superv  isor 

O.xC’O 

route  eaehc  entry 

96 

supervisor 

(»\l  0 

route  eaehc  invalidate 

NA 

superv  isor 

in  Figure  21.  The  status  bits  and  route  cache  entrv  invalidate  are  described  in  later  subsections.  The  source  and  eid  registers  are  intended  to 
be  accessed  onlv  by  the  trusted  su|vrv  isor  kernel.  Such  access  protection  is  accomplished  through  address  aliases  for  the  pbuf,  as  described 
in  the  following  paragraphs.  At  PIM  node  boot  time,  the  kernel  writes  a  node  ID  to  the  source  register,  fhis  value  is  copied  into  the  source 
field  of  ev  ery  outgoing  user  parcel  when  launched  When  anv  new  application  is  swapped  in,  the  kernel  should  also  write  the  application's 
eid  value  into  the  eid  register  in  the  pbuf  send  register  set.  The  value  of  the  eid  register  is  copied  into  the  eid  field  of  every  outgoing  user  par¬ 
cel  when  launched 

The  correspmidmg  register  space  for  the  pbuf  receive  side  is  show  n  in  fable  9, 

r.ABl.K  9.  Parcel  Buffer  Receive  Register  Set 


.Vddress  Relative  to 
Register  Set  Base 

Register  Description 

Physical  Size  (bits) 

.Vccess  Privilege 

0x00 

pay  load 

2.56 

superv  isor  user 

0x20 

header 

96 

supcr\  isor  user 

214 
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.\ddress  Relative  to 
Register  Set  Base 

Register  Description 

Physical  Size  (bits) 

.\ccess  Priv  ilege 

().\4() 

status 

5 

supervisor  limited  user 

().\6() 

reserv  ed 

\A 

NA 

(ivSil 

reserv  ed 

KA 

NA 

()\A() 

reserved 

NA 

NA 

O.xC'O 

reserv  ed 

NA 

NA 

OxET) 

reserv  ed 

NA 

NA 

As  mentioned  prev  lously,  the  256-b\  te  pbut'send  and  receive  register  sets  are  multiply  mapped  to  sup|>ort  a  number  of  desired  features  [-irst, 
t\so  aliases  of  the  send  register  set  are  desired  to  support  different  functions:  one  address  space  for  non-triggering  writes,  one  address  space 
for  triggering  writes.  The  distinction  is  that  w  rites  to  the  non-triggering  address  space  simply  enter  new  data  into  the  send  register  set  but  do 
not  cause  a  parcel  launch  In  contrast,  a  write  to  a  register  w  ithin  the  triggering  address  space  not  only  causes  new  data  to  be  written  into  the 
specified  register  but  also  initiates  a  parcel  launch,  which  results  in  the  parcel  contents  of  the  pbuf  being  forwarded  to  the  PiRC.  The  pro\  i- 
sion  of  triggering  and  non-triggering  spaces  supports  se\  eral  nice  capabilities  but  is  al.so  necessarv  for  restoration  of  the  pbuf  state  upon 
conte.xt  switches.  With  this  suppiirt.  it  is  not  necessarv  to  write  Ivoth  the  header  and  pav  load  to  launch  a  parcel.  For  instance,  if  a  multicasting 
o|vration  is  desired,  it  is  only  necessary  to  write  the  payload  once  to  the  non-triggering  address  and  then  trigger  a  parcel  to  each  destination 
of  the  multicast  by  writing  the  appropriate  header  to  the  triggering  address  for  each  destination  object.  Similarly,  if  it  is  desired  to  send  mul¬ 
tiple  parcel  pav  loads  to  the  same  object,  only  the  pav  load  need  be  written  to  the  triggering  address  for  each  send  once  the  header  has  been 
initialized. 

Similarlv  ,  there  are  two  aliases  of  the  receiv  e  register  set:  one  space  for  non-destructive  reads,  one  space  for  destructiv  e  reads.  The  non¬ 
destructive  read  merely  reads  from  the  pbuf  location  but  does  not  cause  the  data  to  be  remov  ed  from  the  pbuf  The  destructiv  e  read  also 
returns  the  specified  pavload  or  header  register  contents,  but  it  also  causes  the  status  of  the  receive  register  set  to  be  marked  emptv  so  that  anv 
parcel  waiting  at  the  PiRC  mav  then  be  forwarded  to  the  pbuf  In  a  sense,  the  parcel  that  is  read  from  the  destructive  read  address  is  then 
remov  ed  from  the  pbuf 

There  are  other  capabilities  that  are  also  desired  that  are  accomplished  through  even  more  alia.ses  of  the  pbuf  hardware  For  instance,  it  is 
useful  for  a  process  that  is  launching  a  parcel  to  be  able  to  specifv  whether  that  parcel  should  generate  an  interrupt  when  it  arrives  at  its  des¬ 
tination  node.  By  using  another  set  of  alia.ses  for  this  function,  one  address  bit  can  be  decoded  to  determine  if  the  parcel  should  generate  an 
interrupt  once  it  arrives  at  a  destination  pbuf  Fhe  system  should  set  up  the  address  translation  unit  appropriately  to  grant  or  revoke  such  priv¬ 
ileges  from  users.  Any  write  to  a  sending  header  register  address,  whether  it  is  triggering  or  non-triggering,  updates  the  nu  field  of  the  current 
pbuf  header.  If  the  system  or  user  writes  to  the  pbuf  header  field  at  the  interrupting  address,  the  liii  field  of  the  parcel  gets  set  to  indicate  the 
parcel  is  to  generate  an  interrupt  at  the  receiv  ing  pbuf;  otherw  ise,  the  int  field  indicates  no  interrupt. 

.Additionally,  it  is  desired  that  the  su(vrvisor  be  able  to  e.vplicitly  write  the  bits  that  are  to  be  contained  m  an  outgoing  parcel.  A  u.ser  has  no 
control  over  the  route,  source,  eid,  and  int  fields;  these  fields  are  generated  by  mechanisms  set  up  by  the  supervisor  kernel.  However,  the 
suiverv  isor  should  be  able  to  circumvent  such  mechanisms  to  write  to  these  fields  directly.  To  implement  such  a  capability,  another  set  of 
aliases  is  u.sed  It  is  assumed  that  kernel  software  sets  up  the  address  translation  tables  appropriately  so  that  only  the  superv  isor  may  access 
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the  pbuf  hardware  at  the  supersisor  addresses.  In  addition  to  the  ability  of  the  su|x:rv  isor  to  write  bits  explicitly  into  an  outuoing  parcel,  there 
are  parts  of  the  pbuf  hardware  that  are  intended  to  be  accessible  only  to  a  trusted  supers  isor.  as  indicated  in  Table  8  and  fable  9.  Tor  example, 
some  of  the  status  bits  should  only  change  state  when  accessed  by  the  supers  isor  kernel  (refer  to  the  subsection  on  status  bits).  The  supers  i- 
sor  aliases  of  the  pbuf  hardware  accomplish  this  task  as  ssell. 

Combining  support  for  triggering,  non-triggering  ss rites,  destructive'non-destructive  reads,  interrupt  sisecification,  and  supers  isor/user  capa¬ 
bility,  6  aliases  are  used  for  each  of  the  pbuf  send  and  receive  register  sets,  as  shossn  in  Table  10.  The  total  address  space  retjuired  to  support 

lABI.K  10.  .Vdilrcss  Mappiii*^  of  Pbuf  .Miascs 


.Mia.s  .\ddress  Relative  to 

Pbuf  Base 

Register  Set  Type 

Type  of  Write  (send ) 
or  Read  (receive) 

Parcel  Type 

.\ccess  Level 

OxIMKI 

send 

non-triugering 

interrupting 

user 

OxlOO 

send 

triggering 

()\2ii(i 

receive 

lU'ii-desliiielivc 

KA 

Ox.bid 

receive 

deslriictive 

()\4n(i 

send 

non-triggering 

non-interrupting 

()\5()() 

send 

triggering 

()\6()() 

receive 

non-destruetive 

NA 

().x7()() 

receive 

destructive 

0\8il(l 

scikI 

non-triggering 

explicitly  spcci- 
tied 

supervisor 

0x911(1 

send 

triggering 

OxAOO 

receive 

non-destruetive 

NA 

OxBOO 

receive 

destruetive 

this  filsuf  functionality  is  3  Kbytes.  Recall  that  there  need  be  only  one  set  of  parcel  buffer  hardsvare  for  send  and  receive  functions;  the  mul¬ 
tiple  address  mappings  exist  only  to  use  address  bits  to  impart  information  to  the  pbuf  control  hardware.  By  using  address  bits  to  control  such 
features,  access  privileges  to  the  pbuf  hardw  are  can  be  granted  or  revoked  by  the  superv  isor  kernel  by  normal  management  of  the  address 
translation  unit.  Many  of  the  register  aliases  are  not  needed  to  supiiort  some  of  this  functionality  ;  for  example,  a  copy  of  the  receiv  e  register 
set  for  the  interrupting  non-interrupting  launching  capability  is  unnecessary.  Howev  er,  to  accommodate  protection  through  the  segmented 
memory  management  scheme,  it  is  necessary  for  the  alias  sets  to  be  arranged  as  shown  m  Table  10. 

Parcel  Buffer  Status  The  pbuf  send  and  receive  hardware  maintain  a  handful  of  bits  to  indicate  the  state  of  the  pbuf  The  three  status  bits  associated  with  the  send 

Bits  side  are  shown  in  Figure  22.  (Note  this  3-bit  status  register  is  aligned  to  bit  positions  253  -  255  of  the  status  register  address  sfiace  within  the 

send  register  .set.)  The  buffer  empty  bit  indicates  when  it  is  possible  for  a  |xircel  to  be  written  into  the  pbuf  When  this  bit  is  set.  the  bulTer  is 
empty  and  a  new  parcel  may  be  written  to  it.  The  bit  is  reset  indicating  the  bulTer  is  ITill  when  an  application  (user  or  supervisor)  writes  to  a 
triggering  address  to  launch  a  parcel.  It  is  then  once  again  set  when  the  parcel  transits  out  of  the  pbuf  to  the  PiRC  or  host  pbuf  This  status  bn 
can  be  u.sed  to  support  a  "safe  mode"  for  sending  parcels  Before  writing  a  new  parcel  to  the  pbuf,  an  application  can  check  this  bit  to  ensure 
the  pbuf  is  av  ailable,  i.e.,  the  last  parcel  w  ritten  to  it  has  exited. 
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I'he  o\  eiTun  hit  is  used  to  indicate  that  an  application  has  attempted  to  write  a  new  parcel  to  the  phut",  although  the  pbuf  is  not  empt>.  This 
hit  is  a  sticks-  error  bit.  It  gets  set  when  an  o\  errun  occurs  and  is  reset  only  when  the  status  bits  are  read.  Since  it  is  combined  w  ith  other  status 
bits.  It  is  important  that  anytime  the  send  status  is  read,  the  state  of  the  overrun  bit  should  be  checked  if  applications  are  not  using  the  safe 
itKxle  to  send  parcels.  User  writes  to  the  pbuf  send  registers  are  ignored  if  the  overrun  bit  is  set.  The  application  must  clear  the  o\  errun  error 
to  resume  launching  parcels.  This  bit  may  also  be  set  by  a  w  rite  to  any  send  status  register  address  contained  in  a  supers  isor  mapping  of  the 
pbuf  This  capability  is  needed  to  restore  the  state  of  the  pbuf  up<in  context  sw  itch. 


The  route  error  bit  indicates  that  the  sy  stem  does  not  have  sulTicient  information  to  translate  an  object  address  to  a  route.  When  an  applica¬ 
tion  writes  to  the  triggering  address  of  a  send  register  to  launch  a  parcel,  a  certain  amount  of  sy  stem  processing  is  applied  to  the  parcel  before 
it  is  forwarded  to  the  PiRC  or  host  pbuf  One  of  the  tasks  is  the  generation  of  a  route  from  the  object  address.  More  information  about  this 
translation  is  given  in  the  last  section  of  this  chapter.  If  the  pbuf  hardware  cannot  automatically  generate  this  route,  the  route  error  bit  is  set 
causing  an  exception  This  bit  is  reset  when  the  superv  isor  reads  the  status  register  from  a  designated  su|-)erv istvr  alias  for  the  pbuf  When  a 


buffer  empty 
ov  errun 
ntute  error 
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route  error  wcurs.  the  bulTer  empty  bit  is  re-asserted,  even  though  the  parcel  has  not  exited  the  pbuf.  This  allows  the  superv isor  to  read  the 
contents  of  the  pbuf,  construct  the  proper  route  if  possible,  and  then  re-launch  the  parcel  on  behalf  of  the  user  w  ithout  incurring  an  ov  errun 
error. 

The  five  status  bits  associated  with  the  pbuf  receive  register  set  are  shown  m  Figure  23.  (Note  this  5-bit  status  register  is  aligned  to  bit  posi¬ 
tions  25 1  -  255  of  the  status  register  address  space  w  ithin  the  receive  register  set.)  The  bulfer  full  bit  indicates  that  a  parcel  has  been  loaded 
into  the  pbuf  register  set  from  the  PiRC  or  host  pbuf  and  is  available  for  reading.  This  bit  is  set  when  the  PiRC  or  host  pbuf  forwards  data  to 
the  nixie  pbuf  and  is  reset  when  an  application  performs  a  destructive  read  by  reading  from  the  appropriate  pbuf  alias.  When  the  bit  is  reset, 
it  serv  es  as  a  signal  to  the  PiRC  or  host  pbuf  that  the  next  parcel  destined  for  this  pbuf  may  be  forwarded.  If  an  application  performs  a  read 
(destructive  or  non-destructiv  e)  when  the  bulYer  is  empty  ,  all  Os  are  returned  and  the  underrun  status  bit  is  set.  Similar  to  the  send  overrun  bit, 
the  receive  underrun  bit  is  also  sticky  and  remains  set  until  the  user  reads  the  status  register. 

0  I  1  I  2  I  3  I  4 

1_  buffer  full 

-  It  fide  nun 

-  interrupt 

-  bloekiiit' 

-  eid  misiiialeh 
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Parcel  IJiitTcr 
Kxccptioiis 


Object  Address  t<> 
Koutc  rraiislution 


The  interrupt.  bkK'kini>,  and  eid  mismatch  bits  all  generate  e.xceptions  when  set  and  remain  set  until  the  supers  isor  reads  the  status  reeister 
from  a  designated  superv  isor  pbuf  alias.  The  interrupt  bit  is  set  when  a  parcel  that  has  its  interrupt  field  set  arrives  to  the  receive  registers. 
The  bUKking  bit  is  set  w  hen  the  PiRC  or  host  pbuf  is  attempting  to  forward  a  parcel  to  the  receive  registers  but  cannot  do  so  because  the 
receive  registers  still  contain  the  prev  lous  parcel  for  an  e.xtended  pericKl  of  time.  A  buried  10-bit  timeout  counter  is  associated  with  this  func¬ 
tion.  The  counter  starts  incrementing  anvtime  the  pbuf  receive  butfer  is  full  and  a  request  for  parcel  forwarding  from  the  PiRC  or  host  pbuf 
occurs.  If  the  counter  reaches  its  maximum  value,  the  blocking  bit  is  set.  If  the  parcel  is  destructively  read  before  the  counter  expires,  then  the 
counter  is  stopfied  and  cleared,  and  no  blocking  is  recorded  'fhe  eid  mismatch  bit  is  set  when  the  eid  field  of  the  parcel  in  the  receive  regis¬ 
ters  does  not  match  the  pbuf  eid  register  contents  when  a  user  read  is  attempted  on  any  of  the  pbuf  receive  registers,  including  the  status 
registers.  In  such  a  ca.se,  the  butfer  full  status  bit  should  also  be  masked  when  the  status  is  delivered  to  the  reading  prcKess.  The  concept  is 
that  a  user  application  should  view  the  pbuf  receive  registers  as  empty  if  the  parcel  contained  in  them  is  for  a  dilTerent  application,  as  indi¬ 
cated  by  the  eid  values.  However,  the  supervisor  should  always  be  able  to  read  the  pbuf  contents  without  eneountering  any  eid  mismatch 
error,  regardless  of  the  eid  v  alues  in  the  parcel  and  pbuf  configuration.  Therefore,  eid  checking  does  not  apply  to  sufterv  isor  reads  from  the 
pbuf 

As  described  in  the  prev  ious  section  there  are  a  number  of  pbuf  events  that  may  cause  exceptions  The  pbuf  receiv  e  ev  ents  that  cause  an 
e.xception  are  indicated  by  the  interrupt,  blocking,  and  eid  mismatch  status  bits,  fhese  bits  are  simply  Olied  together  to  set  bit  5  of  the  T'.xcep- 
tion  Source  Word  (ESW )  of  the  DIVA  processor  ( refer  to  C'hapter  8).  When  the  kernel  is  invoked  by  this  exception,  the  kernel  must  first  read 
the  pbuf  receive  status  register  to  determine  which  ty|x:  of  event  caused  the  e.xception  and  then  take  appropriate  measures  to  respond  to  the 
exception.  Similarly,  a  pbuf  send  event  that  causes  an  exception  is  indicated  by  the  route  error  bit.  When  set.  this  status  bit  also  cau.ses  bit  6 
of  the  ESW  to  be  set.  Since  this  is  the  only  send  e\  cut  that  may  cause  an  error,  the  kernel  does  not  need  to  perform  any  extra  decoding;  how¬ 
ever,  It  vv  ill  still  need  to  read  the  send  status  register  to  clear  the  route  error  bit 

The  architecture  allows  for  hardware  support  to  facilitate  the  generation  of  routes  from  object  addresses.  The  most  flexible  mechanism  for 
supporting  this  capability'  is  a  route  cache  which  simply  contains  mappings  from  objects  to  routes.  The  superv  isor  kernel  manages  entering, 
placement,  replacement,  and  invalidation  of  all  entries  explicitly  ,  fhe  form  of  a  route  cache  entry  is  shown  in  Figure  24.  (Note  that  the  %- 
bit  entry  is  aligned  to  bits  160  -  255  of  the  route  cache  register  space.)  The  superv  isor  makes  such  an  entry  into  the  route  cache  by  simply 
writing  the  required  data  to  the  route  cache  entry  register  space  of  the  pbuf  .Since  this  space  is  designated  for  superv  isor  access  only,  the 
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16-bit  16-bit  32-bit  32-bit 
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supervisor  pbuf  alias  addre.ss  must  be  used  to  accomplish  a  successful  entry.  The  I6-bit  index  specifies  which  location  of  the  route  cache  to 
store  the  entry  ;  in  this  manner,  the  supervisor  explicitly  manages  placement  and  replacement  activity.  Although  16  bits  are  shown  for  the 
inde.x,  a  smaller  numl'ier  of  bits  may  actually  be  used  in  specific  implementations  of  this  architecture.  For  instance,  if  the  route  cache  allows 
for  only  16  entries,  only  the  4  least  significant  bits  of  the  index  field  are  used. 

For  implementations  which  support  a  route  cache,  when  a  parcel  is  launched,  the  cache  is  searched  for  the  object  address  specified  in  the  par¬ 
cel  fhe  mask  field  of  a  valid  route  cache  entry  indicates  which  bits  of  the  corresponding  object  address  should  be  compared  to  the  parcel 
object  address  to  determine  a  successful  match.  An  equation  specify  ing  a  match  M  where  pamh  is  the  parcel  object  address  and  rcoh  is  the 
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route  cache  object  address  is  given  by  M  =  (mask^  a  ( rcobg  ®  parob,,))  v  (mask|  a  ( rcobj  ®  parob, ))  v  ...  v  (maskj,  a  (rcob^j  ©  parobj^ )) .  Il'a 

match  is  found,  the  corres|X)nding  route  is  written  into  the  route  field  ol'the  parcel  I'he  hardware  does  not  protect  against  matches  on  multi¬ 
ple  entries  in  the  route  cache,  i  e.,  system  software  must  set  up  the  route  cache  appropriately  so  that  only  one  entry  will  match  an>  given 
object  address.  The  route  cache  behavior  is  undefined  if  multiple  entries  match.  The  match  does  not  actually  have  to  include  the  full  32-bit 
address.  Since  the  smallest  allowable  segment  from  the  perspective  of  a  DIVA  PIM  processor  is  256  bytes,  the  match  can  be  performed  using 
the  most  significant  24  bits  of  the  parcel  object  address  and  route  cache  object  address  and  mask. 

The  route  cache  contains  buried  valid  state  bits — one  for  each  entry.  These  valid  bits  are  negated  upon  reset  and  any  time  a  route  cache  inval¬ 
idate  IS  executed.  A  v  alid  bit  for  a  particular  entry  is  set  when  the  index  corresponding  to  that  entry'  is  written  to.  For  the  superv  isor  to  be  able 
to  manage  the  route  butler  between  multitasking  user  processes,  superv  isor-control led  invalidation  is  prov  ided  \  ia  an  address-mapped  mech¬ 
anism.  similar  to  other  pbuf  functions.  Any  time  the  superv  isor  writes  to  the  route  cache  invalidation  address  (offset  O.xEO).  the  entire  route 
cache  is  invalidated.  For  this  mechanism,  the  data  contained  in  the  write  is  irrelevant  since  it  is  not  used  m  any  meaningful  manner. 

The  contents  of  the  route  cache  may  be  read  for  debugging  purposes.  .An  internal  address  counter  is  maintained  to  prov  ide  this  capability. 
Upon  reset,  this  counter  points  to  index  0  of  the  route  cache.  UfHin  each  read  from  the  route  cache  entry  address  (otTset  O.xCO),  data  corre¬ 
sponding  to  the  indexed  entry  indicated  by  the  current  contents  of  the  counter  are  returned,  and  the  counter  increments  to  point  to  the  next 
data.  1  he  counter  value  representing  the  index  and  the  v  alid  bit  status  are  also  returned  for  the  entry  .  The  format  of  the  97-bit  data  returned 
u|'H>n  such  a  read  is  shown  in  Figure  25.  (Note  that  the  97-bit  data  is  aligned  to  bits  159  -  255  of  the  node  data  bus.) 


vj  index"  route 


mask 


object  address 


16-bit  16-bit 


32-bit 


32-bit 


Fi$>iii'c  25:  Data  Format  for  Route  Cache  Read 


In  lieu  of  any  such  hardware  or  if  the  cache  does  not  contain  a  translation  for  the  object  of  a  parcel  to  be  launched,  a  route  exception  tKcurs 
and  the  kernel  must  explicitly  set  up  the  route  field  of  the  parcel  by  using  the  superv  isor  alias  addresses  for  the  pbuf  As  part  of  the  exception 
handling,  the  kernel  may  wish  to  make  an  entry  into  the  route  cache  for  the  object  address  segment  which  cau.sed  the  exception  to  prevent 
further  exceptions  due  to  that  segment.  Ifa  specific  implementation  of  the  pbuf  architecture  does  not  contain  hardware  support  for  route  gen¬ 
eration.  parcel  launching  will  always  invoke  the  supervisor  kernel  to  generate  the  route. 
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Chapter  10  -  Aiklrcss  Translation 


Introduction 


Parcels,  application  code,  and  data  contain  virtual  addresses.  To  interpret  these  addresses,  a  PIM  processor  must  supprirt  a  translation  mech¬ 
anism.  I  lowever,  the  overhead  of  maintaining  conventional  page  tables  at  each  node  is  prohibitive.  To  simplitS  translation,  vve  classify  DIVA 
memorv  according  to  usage; 

•  filiihi/l  memory  is  com(x>sed  of  contiguous  segments  distributed  across  nodes,  v  isible  to  applications  running  on  the  host  and  PIM 
nixies. 

•  dumb  memory  is  a  region  of  a  node's  memorv  allocated  as  conventional  pages  in  a  host  application's  virtual  space  and  untouched 
bv  PIM  node  priKessing. 

•  local  memory  is  a  region  of  a  node's  memorv  used  exclusively  bv  nixie  routines.  This  rule  is  excepted  during  initialization  w  hen 
the  host  system  bixvt  process  loads  node  software. 

A  nixie  must  be  able  to  rapidiv  determine  if  an  address  is  located  in  its  own  memory,  and  if  so,  find  the  physical  address,  fo  condense  trans¬ 
lation  infonnation.  we  use  sefimeiiix,  each  of  which  is  defined  bv  segment  registers  containing  a  base  address  and  size.  The  local  memory 
region  is  partitioned  into  eight  segments  m  the  initial  DIV.A  architecture,  although  this  number  could  change  in  future  DIVA  architectures. 
Like  pages  in  a  conventional  system,  the  segment  descriptors  are  generic  in  nature.  It  is  only  through  system  programming  that  the  segments 
serve  a  specific  purpose.  For  example,  a  logical  allocation  of  the  eight  segments  would  be  to  a.ssign  one  segment  for  each  of  the  following: 

1 .  Kernel  code 

2.  Kernel  data 

3.  Kernel  stack 

4.  Kernel  parcel  biitfer 

5.  l.isercode 

6.  l.lser  data 
7  User  stack 

8.  User  parcel  bulTer 

Remote  addresses  are  translated  via  the  concept  of  a  home  node,  which  is  guaranteed  to  have  the  translation.  In  addition  to  the  local  seg¬ 
ments.  a  node  maintains  translation  information  for  its  resident  portion  of  the  global  memory;  as  well  as  for  any  remote  data  for  which  it  is 
the  home  node.  The  major  adv  antages  of  this  approach  are  that  translation  may  be  accomplished  rapidly  ,  and  translation  information  on  each 
Pl.VI  scales  well. 

The  primary  functions  of  the  node  address  translation  unit  are  to  translate  virtual  addresses  to  phy  sical  addresses  for  those  accesses  vv  hich  are 
locally  resident  and  to  prov  ide  access  protection  The  types  of  accesses  generated  by  a  DIVA  PIM  processor  that  require  translation  include 
instruction  fetches  and  data  accesses  to  memory  or  memory -mapped  dev  ices  such  as  parcel  bulYers.  generated  by  load  or  store  instructions. 

(iiven  the  simplicity  of  the  address  translation  scheme  discussed  above,  very  little  hardware  supixirt  is  needed  to  effect  efficient  translation 
•A  segment  base  address  register  and  limit  register  is  needed  for  each  of  the  eight  local  segments  Also,  one  v  irtual  base,  limit,  and  physical 
base  register  are  needed  for  each  resident  global  segment.  The  initial  DIVA  architecture  prov  ides  four  sets  of  global  segment  registers, 
although  alternative  architectures  could  provide  more.  The  address  translation  unit  contains  no  direct  support  for  home  node  translation. 
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Address  Translation 
Mechanisnis 


although  the  preferred  system  programming  is  such  that  the  global  segments  resident  on  a  ntxle  form  the  portion  of  global  memoiy  for  which 
that  node  is  the  home  node.  If  this  is  not  the  case,  address  faults  invoke  system  software  whieh  performs  the  home  node  translation 

The  DIVA  PIM  processor  provides  4  (iby  tes  of  \  irtual  address  space  accessible  to  kernel  and  user  applications  v  ia  segments  that  are  a  ptiwer 
of  2  in  size.  Segment  sizes  can  range  from  256  bytes  to  the  maximum  amount  of  physical  memory  available  to  a  node,  fhe  initial  DIVA 
architecture  sup|vorts  a  maximum  segment  size  of  16  MBvtes.  l:ach  v  irtual  address  generated  by  the  PIM  processor  is  32  bits  wide,  and  the 
resulting  physical  address  generated  bv  the  address  translation  unit  is  also  32  bits  wide,  although  implementations  may  reduce  this  width  to 
optimize  for  the  actual  amount  of  physical  memorv  present. 

The  PIM  processor  address  translation  unit  supports  three  main  tv  pes  of  address  translation: 

•  direct  address  translation 

•  local  address  translation 

•  global  address  translation 


virtual  address  (va) 

0  4  5  31 


I'imirc  26:  .\tlilrcss  I'ranslation  Types 


Figure  26  shows  the  threx:  main  address  translation  mechanisms  prov  ided  When  the  address  translation  unit  is  disabled,  direct  address  trans¬ 
lation  occurs,  and  the  address  translation  unit  will  not  generate  any  e.xceptions.  In  this  ease,  the  resulting  physical  address  is  identical  to  the 
V  irtual  address.  If  address  translation  is  enabled,  then  the  scoive  field  of  the  virtual  address  must  be  insivected  to  determine  what  ty  |X’  of  trans¬ 
lation  should  be  used.  In  the  initial  DIVA  architecture,  the  sco|'>e  field  is  the  most  significant  five  bits  of  the  virtual  address  VA  If  this  5-bit 
value  is  zero,  then  local  translation  is  used  If  the  scope  field  equals  binary  value  00001,  i.e.,  the  v  irtual  address  falls  in  the  range  of 
0x08000000  to  OxOFFFFFFF,  direct  translation  is  used  to  generate  the  physical  address;  however,  unlike  the  mode  where  address  translation 
is  disabled,  an  exception  can  be  generated  m  this  ca.se  il' access  privileges  are  v  iolated  By  definition,  the  address  region  0x08000000  to 
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OxOFFFFI'FF  is  a  supen  isor-level  reuion.  Therefore,  any  user-level  altenipl  to  access  this  reeion  while  address  translation  is  enabled  will 
trigger  an  exception.  Lastly,  if  any  of  the  four  most  significant  bits  of  the  \  irtual  address  are  non-zero,  i.e.,  va|0:3]  !=  0,  then  global  transla¬ 
tion  is  used 

Figure  27  shows  the  steps  involved  in  loeal  address  translation.  The  3-hit  index  field  of  the  virtual  address  is  u.sed  to  select  a  set  of  l<Kal  seg¬ 
ment  registers  for  the  translation.  Fhe  segment  base  is  simply  bitw  ise-ORed  w  ith  the  zero-padded  offset  of  the  v  irtual  address  to  form  the 
pin  sical  address,  fhe  specified  segment  limit  register  is  also  accessed  and  manipulated  in  conjunction  w  ith  the  olTset  to  determine  if  the  vir¬ 
tual  address  is  valid.  More  information  on  protection  is  given  in  the  next  section 


v  irtual  address  (va) 

0  4  5  7  8 


I'imirc  27:  Local  .Vddress  Translation 


Figure  28  shows  the  steps  involved  in  global  address  translation,  which  is  a  reverse  address  translation  style.  In  this  case,  the  address  is 
checked  to  see  if  it  is  mapped  locally  bv  simplv  ensuring  that  the  address  is  w  ithin  the  range  specified  by  a  valid  set  of  the  global  segment 
base  address  and  limit  registers.  The  hardware  does  not  protect  against  overlapping  global  segments,  i  e.,  system  software  must  set  up  the 
global  segment  registers  appropriately  so  that  anv  global  \  irtual  address  is  contained  in  at  most  one  global  segment  The  multiple  sets  of  glo¬ 
bal  segment  registers  are  checked  concurrently  to  see  if  anv  one  of  them  should  be  used  for  the  translation,  similar  to  a  fully  a.ssociative 
cache.  If  there  is  no  match,  a  translation  exception  wcurs.  More  detail  on  this  matehing  and  protection  checking  is  given  in  the  next  section. 
If  there  is  a  match,  the  v  irtual  address  is  simply  translated  into  a  physical  address  by  a  bitwise-OR  of  an  offset  with  the  global  segment  phys- 
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ical  base  register  of  the  matching  global  segment.  Hie  offset  is  formed  b\  using  the  limit  register  of  the  matching  segment  to  mask  off  the 
appropriate  part  of  the  virtual  address. 


\irtual  address  (va) 

0  4  5  31 


rii'urc  2S:  (ilobal  .Vildrcss  I'ranslation 


Memory  .Vccess 
Pntleclion 


In  addition  to  the  translation  of  virtual  addresses  to  physical  addresses,  the  address  translation  unit  pros  ides  access  protection  and  bounds 
checking  to  ensure  that  the  offset  portion  of  an  address  is  not  outside  the  range  of  the  segment  fhe  2  PR  bits  of  a  segment  limit  register  spec- 
ily  the  access  protection  mode  for  that  segment,  fable  1 1  shows  the  possible  access  modes  and  their  corresponding  encodings. 


r.MiLK  11.  Segment  .Vcccss  Modes  and  ('orrespondin«j  PR  Bit  Kncodiii<>s 


Kneuding  of  PR  Bits 

Supervisor  Privilege 

Tser  Privilege 

(!<• 

R\\  (rvail-vMilel 

R\\ 

01 

R\V 

RO  (read  only) 

10 

R\\ 

none 

II 

RO 

none 

Eiach  local  segment  limit  register  consists  of  a  limit  value,  a  \  alid  bit.  and  the  two  PR  bits  The  first  level  of  protection  for  local  addresses  is 
provided  by  ensuring  that  a  \  alid  set  of  segment  registers  is  used.  If  the  V  bit  of  the  selected  local  segment  is  not  asserted,  an  unmapped 
access  exception  oecurs  (refer  to  Chapter  8).  The  second  level  of  proteetion  is  pros  ided  by  the  PR  bits.  If  the  PIM  processor  mode  (supen  i- 
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sor  or  user)  and  access  type  (read  or  write)  are  not  allowed  b>  the  PR  hit  setting  of  the  selected  segment,  an  invalid  access  exception  occurs 
(refer  to  Chapter  8).  The  final  level  of  protection  for  local  addresses  is  prov  ided  with  bounds  checking.  The  limit  value  of  the  specified  seg¬ 
ment  IS  used  to  inspect  bits  in  the  virtual  address  olfset  to  ensure  that  the  otfset  has  not  exceeded  the  segment  size.  If  the  segment  size  is 
exceeded,  an  unmapped  access  exception  occurs  (refer  to  C’hapter  8).  Assuming  the  limit  value  has  been  set  aecording  to  the  Implications 
section  at  the  end  of  this  chapter,  an  equation  specify  mg  the  exception  condition  E  is: 

E  =  ( vUg  A  liniit|indcx]g)  v  ( va,  a  liinillindcx],)  v  ...  v  ( va23  a  limit|indc\]23l 

.Mthough  the  conditions  for  address  translation  exceptions  for  global  virtual  addresses  are  similar  to  that  of  local  addresses,  the  mechanism 
is  quite  dilferent  due  to  the  fully  asscKiative  nature  of  the  global  segment  hardware.  Basically,  if  one  of  the  four  sets  of  global  segment  reg¬ 
isters  does  not  "match”  an  attempted  global  address  access,  an  exception  occurs.  .A  successful  match  iKcurs  when  a  set  of  segment  registers 
is  valid,  the  PR  bit  setting  allows  the  access  ty  pe  being  attempted,  and  tbc  address  range  specified  by  the  global  v  irtual  ba.se  and  limit  encom¬ 
passes  the  global  address  of  the  operation.  An  equation  s|x:eifying  the  range  match  condition  RM,  where  va  is  the  v  irtual  address  and  base  is 
the  contents  of  the  global  v  irtual  base  register,  is; 

RM  =  ( hinitg  a  (va^  ©  basc^))  v  ( limit ,  a  ( va,  ©  base, ))  v  ...  v  ( limit,,  a  ( va23  ©  basc23)) 

.'\n  unmapived  access  e.xception  is  triggered  if  there  is  no  valid  set  of  registers  that  pass  the  range  match  test.  If  there  is  a  valid  set  of  registers 
that  passes  the  range  match  test,  but  the  PR  bits  for  that  segment  do  not  allow  the  attempted  access,  an  invalid  access  exception  occurs  (refer 
to  Chapter  8). 

The  primary  instruetion  supported  by  the  DIV.<\  instruction  set  which  alTects  address  translation  operation  is  the  M'lATR  (move  to  address 
translation  register)  instruction.  The  destination  field  of  this  instruction  can  be  set  to  specify  any  local  base  register,  local  protection  register, 
global  physical  base  register,  global  limit  register,  or  global  physical  base  register.  Since  the  contents  of  a  (iPR  is  the  data  source  for  an 
MTATR  instruction,  each  of  these  address  translation  unit  registers  is  defined  to  be  32  bits  wide,  although  implementations  may  truncate 
some  segment  registers  to  optimize  for  the  actual  amount  of  phy  sical  memory  present.  Furthermore,  each  limit  register  is  a  concatenation  of 
a  limit  value,  a  valid  bit.  and  the  two  PR  bits.  The  M'fPR  instruction  is  also  used  to  cnable  disable  address  translation  by  writing  to  the  appro¬ 
priate  bit  of  the  PSVV  register. 

There  are  a  number  of  stipulations  implied  for  the  address  translation  mechanisms  described  in  this  chapter  to  o|x;rate  correctly .  First,  ev  ery 
segment  size  must  be  a  power  of  2,  and  the  base  address  for  each  segment  must  be  aligned  to  a  value  that  is  a  multiple  of  the  segment  size. 
.Mso,  the  limit  value  must  be  set  so  that  simple  logic  functions  can  be  used  for  translation  and  protection  checking.  For  example,  a  segment 

size  of  l”  should  have  a  limit  register  v  alue  that  is  ( 2"  -  I ) .  Finally,  the  virtual  to  phy  sical  translation  for  code  segments  must  not  art'ect  the 
12  least  significant  bits  so  that  instruction  cache  Imrk-ups  can  proceed  concurrently  with  translation.  While  stipulating  that  code  segment 
base  addresses  must  be  some  multiple  of  4Kbly  es  is  sulTicient,  it  is  not  necessary  ,  and  less  strict  policies  can  be  used  to  ensure  the  require¬ 
ment  IS  met. 

The  exception  portion  of  the  architecture  assumes  that  instruction  and  data  address  translations  are  independent  Thus,  the  PSW  contains  two 
address  translation  enable  bits  (one  for  instruction  addresses  and  one  for  data  addresses).  Likewise,  the  exception  source  word  contains  sep¬ 
arate  status  bits  for  instruction  and  data  translation  exceptions  (refer  to  C’hapter  8).  Fhere  are  also  implications  for  better  performance.  For 
example,  to  allow  address  translation  for  both  instruction  fetches  and  data  fetches  to  proceed  concurrently,  the  address  translation  hardware 
must  be  dual-|vorted. 
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Appendix  C:  HiDISC  Final  Report 


225 


HiDISC:  A  Decoupled  Architecture  for  Applications 
in  Data  Intensi^  e  Computing 
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Abslnu'l 


The  ever  growing  speed  gap  between  processor  and  main  memorj  has  been  a  major 
performance  bottleneck  of  modem  computer  systems.  As  a  result,  today’s  data  intensive 
applications  suffer  from  frequent  cache  misses  and  lose  manv  CPU  cycles  due  to  pipeline 
stalling.  Although  traditional  prcfctching  methods  reduce  cache  misses  considerably, 
most  of  them  strongly  depend  on  the  access  pattern  being  predicted  and  fail  when  faced 
with  irregular  memory  access  patterns  with  low  locality 

This  report  presents  our  design  and  performance  evaluation  of  a  novel,  high- 
performance  decoupled  architecture  called  HiDISC  (Hierarchical  Decoupled  Instniction 
Stream  Computer).  HiDISC  provides  low  memory  access  latency  by  introducing 
enhanced  data  prcfctching  techniques  at  both  hardware  and  software  levels.  Three 
dedicated  processors  for  each  level  of  the  memory  hierarchy  act  in  concert  to  mask  the 
memory  latency. 

As  required  by  the  D.ARPA  Data  Intensive  program,  vve  used  as  our  performance 
evaluation  benchmarks  the  Data-intensive  Systems  Benchmark  Suite  and  the  DIS 
Stressmark  suite.  The  simulation  results  for  both  benchmarks  show  a  distinct  advantage 
of  the  HiDISC  system  ov  er  current  prev  ailing  superscalar  architectures. 
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1.  Introduction 


The  speed  mismatch  between  processor  and  main  memory  has  been  a  major  performance 
bottleneck  in  modern  processor  architectures.  Processor  speed  has  been  improving  at  a 
rate  of  6()"o  per  year  during  the  last  decade.  ConvcrscI)  ,  access  latency  to  main  memor\ 
has  been  improving  at  less  than  10”b  per  year  [24].  This  speed  mismatch  -  the  Memorv' 
Wall  problem  -  residts  in  considerable  cost  in  terms  of  cache  misses  and  severelv 
degrades  processor  performance.  The  problem  becomes  even  more  acute  when  faced 
with  highly  data  intensive  applications.  Indeed,  these  applications  arc  becoming  more 
prevalent.  B)’  definition,  the)  have  a  higher  memory  access/computation  ratio  than 
"conventionar'  applications.  Moreover,  the  access  pattern  tends  to  be  more  irregular.  .As 
a  result,  the  penalty  caused  by  cache  misses  is  becoming  even  more  serious.  This  means 
that  the  architect  must  cither  reduce  pipeline  stalling  upon  cache  misses  or  reduce  the 
number  of  those  cache  misses  (incidentally,  this  latter  objective  is  the  main  goal  of  the 
HiDISC  project). 


rigiire  I:  The  speed  inismateh  between  CPI'  cycle  and  DRAM  speed 

Reaching  higher  Instruction-Level  Parallelism  (ILP)  through  multiple  instruction 
issue  and  out-of-order  execution  has  been  an  essential  part  of  modern  processor  design 
for  many  years.  Moreover,  sophisticated  branch  prediction  and  speculative  execution 
techniques  prov  ide  more  opix)rtunities  for  the  discovery  of  independent  instructions 
across  basic  blocks  [3 1  ].  Various  approaches  using  Thread-Level  Parallelism  (TLP)  have 
also  been  introtiuced  to  deliver  more  ILP.  During  the  last  decade,  superscalar  and  very 
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long  instruction  word  (\  LIW)  architectures  have  placed  an  important  role  in  ILP 
rcsearcli.  Although  both  models  are  designed  to  deliver  higher  levels  of  parallelism 
through  multiple  instruction  issue,  the  ever  increasing  memory  access  latency  has  become 
a  major  obstacle  to  the  exploitation  of  higher  degrees  of  ILP. 

To  solve  the  Dicniory  wall  problem,  current  high  performance  processors  are 
designed  with  large  amounts  of  integrated  on-chip  cache.  However,  this  large  cache 
strategy  works  efficiently  only  for  applications  which  exhibit  sufficient  temporal  or 
spatial  locality.  Newer  applications  such  as  multi-media  processing,  database,  embedded 
processor,  automatic  target  recognition,  and  any  other  data  intensive  programs  exhibit 
irregular  memory  access  patterns  [15]  and  result  in  considerable  numbers  of  cache  misses 
which  cause  significant  performance  degradation. 

To  reduce  the  occurrence  of  cache  misses,  various  prefetching  methods  have  been 
developed.  Prefetching  is  a  mechanism  by  which  data  is  fetched  from  memory  to  cache 
before  it  is  even  requested  by  the  CPU.  It  can  be  implemented  either  in  hardware  or  in 
softvsare.  Hardware  prefetching  [6]  dNiiamically  adapts  to  the  runtime  memory  access 
behavior  and  decides  the  next  cache  block  to  prefetch.  Software  prcfctching  [20]  usually 
inserts  the  prefetching  instructions  inside  the  code.  .Although  previous  prefetching 
research  considerably  contributed  to  improvements  in  cache  performance,  prefetching 
techniques  still  suffer  from  irregular  memoi)  access  patterns.  Indeed,  typical  prcfctching 
strategics  strongl)  depend  on  the  predictabilit)  of  the  future  data  addresses.  This  is  ver) 
difficult  to  predict  when  the  access  patterns  arc  random  [10].  Morco\cr,  many  current 
applications  use  sophisticated  data  structures  with  pointers  which  dramatically  lower  the 
regularity  of  memory  accesses. 

The  Data  Intensive  Systems  Benchmark  Suite  and  the  DIS  Stressmark  Suite  arc 
used  in  this  project  as  our  pcrfonnancc  evaluation  benchmarks.  Both  benchmarks  are 
provided  by  .Atlantic  Aerospace  Electronics  Corporation  [38][30]  and  supported  by  the 
Data  Intensive  Systems  project  of  the  D.ARPA  Information  Technology  Office. 
Stressmark  includes  seven  small  data  intensive  benchmarks.  Conversely,  the  DIS 
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benchmarks  consist  of  five  codes  more  realistic  than  Stressmark.  The  five  benchmarks 
can  be  categorized  into  three  groups: 

1.  The  Model  based  image  generation  group  has  two  benchmarks  -  Method  of 
Moments  and  Simulated  SAR  Ray  Tracing. 

2.  The  Target  detection  includes  Image  Understanding  and  Multidimensional 
Fourier  Transform. 

3.  The  Data  Management  benchmark 

2.  Method,  A.ssiiiii|)tions,  and  Procedures 

In  order  to  counter  the  inherently  low  locality  in  Data  Intensive  applications,  our  design 
philosophy  is  to  emphasize  the  importance  of  memory-related  circuiti)  and  even  employ 
two  dedicated  processors  to  ivspectively  manage  the  memory  hierarchy  and  prefect  the 
data  stream. 

2.1  riic  IliDISC  System 

Access/Execute  decoupled  architectures  have  been  developed  as  alternate  processor 
architectures  which  exploit  the  parallelism  between  data  access  operations  and  ’“normal” 
computation.  Concurrency  is  achieved  by  separating  the  original,  single  instinction 
stream  into  two  streams  based  on  the  functionality  of  instructions.  Asynchronous 
operation  of  the  streams  prov  ides  for  a  certain  distance  between  the  streams  and  makes 
data  prefetching  possible.  The  HiDlSC  architecture  is  an  enhanced  variation  of 
conventional  decoupled  architectures. 

Decoupled  architectures  (also  called  Access/Execute  architectures)  deliver  higher 
degrees  of  Instruction-Level  Parallelism  by  separating  the  sequential  code  into  two 
instruction  streams  -  .-Icreii  Stream  and  Execute  Stream  -  based  on  memory  access 
functionality.  Each  stream  runs  almost  independently  of  the  other.  The  model  was 
originally  dev  eloped  to  tolerate  long  memory  latencies:  hopefully,  the  Access  Stream  will 
run  ahead  of  the  Execute  Stream  in  an  asynchronous  manner,  thereby  allowing  timely 
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prefetching.  It  should  be  noted  at  this  )x>int  that  an  extremely  inijuirtant  parameter  will 
be  the  “distance”  between  the  instruction  cuncntly  producing  a  data  clement  in  the 
Access  Stream  and  the  instruction  which  uses  it  in  the  Execute  Stream.  This  is  also 
called  the  slip  Jislance,  and  it  will  be  shown  how  it  is  a  measure  of  tolerance  to  high 
memory  latencies.  Communication  is  achieved  via  a  set  of  FIFO  queues  (they  are 
architectural  queues  between  the  two  processors  to  guarantee  the  correctness  of  program 
flow). 

Our  HiDlSC  (Hierarchical  Decoupled  Instruction  Stream  Computer)  architecture 
is  a  variation  of  the  traditional  decoupled  architecture  model.  In  addition  to  the  two 
processors  of  the  original  design,  the  HiDISC  comprises  one  more  processor  for  data 
prefetching  [6][8]  (Figure  2).  A  dedicated  processor  for  each  level  of  the  memory 
hierarchy  timely  supplies  the  necessary  data  for  the  above  processor.  Thus,  three 
individual  processors  are  combined  in  this  high-performance  decoupled  architecture. 
They  are  used  respective!)'  for  computing,  memorj  access,  and  cache  management: 

AI.ii  insinictions 


^'lgure  2:  'Hie  IliDISC  System 

•  Computation  Processor  (CP):  executes  all  primary  computations  except  for 
memory  access  instructions. 

•  Access  Processor  (AP):  performs  basic  memor\  access  operations  such  as 
loads  and  stores.  It  is  responsible  for  passing  data  from  the  cache  to  the  CP. 


Regi&iin 
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Cache  Management  Processor  (CMP):  keeps  the  cache  supplied  with  data 
which  will  be  soon  used  by  the  AP  and  reduces  the  cache  misses,  which 
would  otherwise  sev  erely  degrade  the  data  preloading  capability  of  the  AP. 


By  allocating  additional  processors  to  each  level  of  the  memoiy  hierarchy,  the 
overhead  of  generating  addresses,  accessing  memory,  and  prefetching  is  removed  from 
the  task  of  the  CP;  the  processors  arc  decoupled  and  work  rclativcl)’  independently  of  one 
another. 


Slip  Control 
Queue 


Store  Data 
Queue 


U  Cache 
and  Higher  Le\d 


I'ittiircd:  Inside  the  IliDISC  architecture 


Now,  our  compiler  must  appropriately  form  three  streams  from  the  original 
program:  the  computing  stream,  the  memory  access  stream,  and  the  cache  management 
stream  arc  created  by  the  Hi  DISC  compiler  and  stored  into  the  program  memoiy  of  each 
of  the  processors.  As  an  example.  Figure  4  show  s  the  stream  separation  for  the  inner 
loop  of  the  discrete  convolution  algorithm. 

The  control  flow  instructions  arc  executed  by  the  AP.  Incidentally,  it  should  he 
noted  that  additional  instructions  are  required  in  order  to  facilitate  the  synchronization 
between  the  processors.  .Also,  the  .AP  and  the  CP  use  sivcially  designed  tokens  to  ensure 
correct  control  flow:  for  instance,  when  the  .AP  terminates  a  loop  operation,  it  simply 
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i.lc[X)sits  the  Encl-Ot-Data  (EOD)  token  into  the  load  data  queue.  When  the  CP  secs  an 
EOD  token  in  the  load  data  queue,  it  exits  the  loop. 
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Ei}’***'*^  Discrete  Convolution  as  processed  by  the  IliDISC  Compiler 


2.2  Experimental  Environment 

In  order  to  evaluate  the  performance  of  our  proposed  architecture,  we  have  designed  a 
simulator  for  our  HiDISC  architecture.  It  is  based  on  the  SimpleScalar  3.0  tool  set  [5] 
and  it  is  an  execution-based  simulator  which  describes  the  architecture  at  a  level  as  low 
as  the  pipeline  states  in  order  to  accurately  calculate  the  various  architectural  delays. 

Figure  5  shows  a  high-level  block  diagram  of  the  simulation  procedure.  Each 
benchmark  program  follows  the  two  steps  described.  The  first  step  consists  in  compiling 
the  target  benchmark  using  the  HiDISC  compiler  which  we  have  designed,  while  the 
second  step  is  the  simulation  and  performance  evaluation  phase. 
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ri}»iirc  5:  Siiiuihition  Procedure 


2  J  Operation  of  the  IliDISC  Compiler 

The  HiDISC  executables  arc  produced  by  our  HiDISC  compiler.  The  core  operation  of 
the  HiDISC  compiler  is  stream  separation.  Stream  separation  is  achieved  by  backvsard 
chasing  of  load/store  instructions  based  on  the  register  dependencies.  This  means  that,  in 
order  to  obtain  the  register  dependencies  between  instructions,  a  Program  Flow  Graph 
(PFG)  must  be  derived.  Indeed,  the  PFG  generator  and  the  stream  separator  arc  two 
major  operations  of  the  HiDISC  compiler.  The  PFG  generator  and  the  stream  separator 
are  adopted  after  some  modifications  from  the  SimpleScalar  3.0  tool  set  and  integrated  in 
the  HiDISC  compiler. 

Figure  6  depicts  the  overall  HiDISC  compiler.  Its  detailed  operation  is  described 

below. 
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2:  Definiqg  (n^i.^base 
Iflttnstiois 


The  input  to  the  HiDlSC  compiler  is  a  conventional  sequential  binar\  code.  The 
first  step  (1:  Deriving  the  Program  Flow  Graph  in  Figure  6)  consists  in  uncovering  the 
data  dependencies  between  the  instructions.  Each  instmetion  is  analyzed  so  as  to 
deteniiinc  which  its  parent  instructions  arc.  This  determination  is  based  on  the  source 
register  names.  VVhenever  the  stream  separator  meets  any  load  store  instruction  in  step  2 
(2:  Defining  Load/Store  Instructions)^  it  defines  the  instruction  as  the  .Access  Stream 
(AS)  and  chases  backward  to  discover  its  parents  instruction.  The  ne.xt  step  (3: 
Instruction  Chasing  for  Backward  Slice)  is  designed  to  handle  the  backward  chasing  of 
|X)intcrs.  The  instructions  which  arc  chased  according  to  the  data  dependencies  arc  called 
the  backward  slice  of  the  instruction  from  w  hich  vve  started. 

Since  the  Access  Stream  should  contain  all  access-related  instructions,  as  well  as 
the  address  calculation  and  index  generation  instructions,  the  backward  slice  should  be 
included  in  the  Access  Stream  as  well.  It  should  be  noted  that  all  the  control-related 
instructions  arc  also  part  of  the  .Access  Stream.  The  instructions  w  hich  should  belong  to 
the  control  flow  arc  determined  by  a  similar  method.  .After  defining  all  the  Access 
Stream,  the  remaining  instructions  arc,  by  default,  classified  as  belonging  to  the 
Computation  Stream  (CS). 

In  addition  to  the  stream  separation,  appropriate  communication  instructions 
should  be  placed  in  each  stream  in  order  to  synchronize  the  two  streams.  Finding  what 
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the  required  comiminications  arc  is  also  based  on  the  register  dependencies  between  the 
streams.  Essentially,  vvhen  it  is  determined  that  some  required  source  data  is  produced  by 
the  other  stream,  some  kind  of  communication  should  take  place.  For  instance,  vvhen  a 
memory  load  (inside  the  .Access  Stream)  produces  a  result  which  should  be  used  by  the 
Computation  Stream,  a  Load  instruction  would  be  inserted  in  the  Access  Stream.  It 
would  send  the  data  to  the  Load  Data  Queue  (LDQ).  However,  if  the  result  of  that  load 
was  not  needed  by  the  Computation  Stream,  then  obviously  no  such  inseilion  would  be 
needed.  Similarly,  when  the  result  produced  by  the  Computation  Stream  is  used  by  a 
store  instruction  (inside  the  .Access  Stream),  it  should  be  sent  to  the  store  data  queue 
(SDQ)  by  inserting  an  appropriate  communication  instruction. 

The  backward  chasing  starts  whenever  we  encounter  new  load  store  instruction. 
The  backward  chasing  ends  when  the  procedure  meets  an)'  instruction  which  already  has 
been  defined  as  the  .Access  Stream.  The  parent  instmetions  of  any  defined  Access 
Stream  have  already  been  chased. 

After  separating  the  Access  Stream  and  the  Computation  Stream,  the  CMP  stream 
is  constructed  by  modif)ing  the  Access  Stream.  The  instruction  stream  for  the  CMP  is 
indeed  quite  similar  to  the  Access  Stream.  Only  the  load  instructions  arc  replaced  with 
the  prefetch  instructions  for  the  CMP  stream. 

Figure  7  shows  an  c.xampic  of  the  operation  of  the  backward  slicing  mechanism  in 
the  HiDISC  compiler.  The  assembly  code  input  to  the  HiDISC  compiler  is  the  PIS.A 
(Portable  Instruction  Set  Architecture)  which  is  the  instruction  set  of  the  SimplcScalar 
simulator  [5].  We  have  selected  for  this  example  the  inner  product  of  Livermore  loop 
(1111).  The  PISA  code  is  compiled  into  SimplcScalar  binary  by  first  using  a  version  of 
which  targets  SimplcScalar. 

Initially,  each  memory  access  instruction  is  defined  as  belonging  to  the 

every 

parent  instruction  of  a  memory  access  instruction  should  be  identified.  In  the  example, 
the  uiltlii  instruction  in  the  fourth  line  (jvointed  to  by  an  airow  (2))  -  due  to  the  register  Sd 
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-  and  the  itnil  instruction  in  the  second  line  (|X)inted  to  by  an  arrow  (3))  -due  to  the 
register  S23  -  arc  also  chased  and  marked  as  belonging  to  the  AS.  Likewise,  other 
instructions  are  examined  based  on  the  above  approach.  The  instructions  in  the  shaded 
box  in  Figure  7  belong  to  the  Access  Stream. 


q  >1k|*lr*i|k  l0|  •  l«/|k  II  ); 


□ 


('vtnpaution  Slrcam 
Arnss  Stream 


( vtnniumcuk'  via  (JX^ 


! 


aJJu  S1\.S25.SI2 


ConinmikMto  >ia  SDQ 


&.d  Sr4.0lS))) 

I'iguiT  7:  Backward  chasing  of  load/storc  instructions 


After  defining  each  stream,  the  communication  instructions  should  be  inserted. 
The  red  lines  in  Figure  7  (forward  arrows,  solid  lines)  show  the  necessary 
communications  from  the  AS  to  CS.  For  example,  the  mul.d  instruction  (which  is 
marked  as  being  inside  the  Computation  Stream,  pointed  to  by  arrow  0)  in  the  seventh 
line  requires  data  from  the  other  instruction  stream  (The  Access  Stream).  Therefore,  both 
/.</  instructions  in  the  fifth  and  sixth  line  need  to  send  data  to  LDO-  Likewise,  the  purple 
line  at  the  bottom  (forward  arrow,  dotted  line)  also  shows  the  communication  from  the 
CS  to  the  .AS  via  the  Store  Data  Queue  (SDQ). 

Figure  8  shows  the  complete  separation  of  the  two  streams  and  insertion  of  the 
communication  instructions. 
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Figiiro  8:  Sepanition  of  sequential  code 


2.4  Benchmark  nescription 

Applications  causing  large  amounts  of  data  traffic  arc  often  refened  to  as  data-intensive 
applications  as  op{X)scd  to  computation  intensive  applications.  Inherently,  data-intensive 
applications  use  the  majority  of  the  resources  (time  and  hardvvaie)  to  transport  data 
between  the  CPU  and  the  main  memory.  The  tendency  for  a  higher  number  of 
applications  to  become  data  intensive  has  become  quite  pronounced  in  a  variety  of 
environments  [39],  Indeed,  man>  applications  such  as  Automatic  Target  Recognition 
(ATR)  and  database  management  show  non-contiguous  memor)'  access  patterns  and 
currently  result  in  idle  processors  due  to  data  stanation.  These  applications  are  more 
stream-based  and  result  in  more  cache  misses  due  to  lack  of  locality. 

Frequent  use  of  memory  dereferencing  and  pointer  chasing  also  creates  an 
enhanced  pressure  on  the  memory  system.  Pointer-based  linked  data  structures  such  as 
lists  and  trees  are  used  in  many  cunent  applications.  For  one  thing,  the  increasing 
ix)pularity  of  Object  Orient  Programming  correspondingly  increases  the  underlj  ing  use 
of  pointers.  Due  to  the  serial  natural  of  pointer  processing,  memory  accesses  become  a 
severe  performance  bottleneck  of  existing  computer  systems.  Flexible,  dynamic 
construction  allows  linked  structures  to  grow  large  and  difficult  to  cache.  At  the  same 
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time,  linked  data  structures  arc  traversed  in  a  way  that  prevents  individual  accesses  from 
being  overlapped  since  they  are  strictly  dependent  upon  one  another  [2b]. 

The  applications  for  which  our  HiDISC  is  designed  are  obviously  data  intensive 
programs,  the  performance  of  which  is  strongly  affected  by  the  memorv’  latency.  As 
required  by  the  Data  Intensive  Systems  project  of  the  DARPA  Information  Technology 
Office,  we  used  for  our  benchmarks  the  Data-intensive  Systems  Benchmark  Suite  [39] 
and  DIS  Stressmark  Suite  [38]  provided  by  the  Atlantic  Aerospace  Electronics 
Cor|X)ration.  Both  of  the  benchmarks  arc  targeting  data  intensive  applications.  The  DIS 
benchmarks  are  five  benchmarks  codes,  which  arc  more  realistic  and  larger  than 
Stressmark.  Stressmark  includes  seven  small  data  intensive  benchmarks,  which  extracts 
and  shows  the  kernel  operation  of  data  intensive  programs. 

Due  to  problems  with  the  input  data  file,  the  Image  Understanding  benchmark 
cannot  be  executed.  Also,  since  the  Corner-Turn  benchmark  among  seven  Stressmarks  is 
not  prov  ided  with  the  source  code,  we  only  simulated  the  other  six  Stressmarks. 

Table  1  shows  the  characteristics  of  each  of  the  benchmarks  simulated. 
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3.  Results  and  Discussion 

We  used  our  architectural  simulator  of  the  Hi  DISC  machine  to  evaluate  the  peiformance 
of  all  the  benchmarks  except  two. 


3.1  Simulation  Parameters 

In  our  benchmark  simulations,  we  assumed  the  architectural  parameters  outlined  in  Table 
2.  The  baseline  architecture  for  the  comparison  is  a  4-vvay  superscalar  architecture, 
v\  hich  is  implemented  as  sim-outorder  in  the  SimpleScalar  3.0  tool  set.  In  both  cases,  the 
memory  access  latency  has  been  made  to  var)’  between  20  and  120  CPU  cycles.  The 
baseline  superscalar  architecture  sup|x>its  out-of-order  issue  with  16  register  update  units 
and  S  load  store  queues. 
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3.2  Benchmarks  Results 

Figure  9  and  Figure  10  show  the  simulation  results  of  the  DIS  Benchmark  Suite  and  the 
Stressmark  Suite.  The  performance  results  of  the  FI  i DISC  architecture  are  compared  to  a 
4-vvay  superscalar  architecture.  The  far  left  bar  indicates  the  performance  results  of  the 
superscalar  architectures.  The  second  bar  expresses  the  performance  results  of  the  basic 
HiDISC  architecture.  The  remaining  tv\o  bars  show  the  possible  performance  results 
when  enhancing  the  prefetching  capability  of  the  CMP  processor.  The  numbers  in 
parenthesis  express  the  cache  miss  reduction  ratio.  The  enhancements  will  be  explained 
in  more  detail  in  the  next  section. 
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Figure  l)IS  benchinurk  prrfurnuuicr  results 


All  four  DIS  bcncliiiiarks  show  better  performance  than  the  baseline  superscalar 
architecture.  However,  v\ith  the  Stressmark,  only  two  of  the  six  cases  show  better 
perforniancc  for  the  HiDISC.  The  remaining  four  benchmarks  do  not  show  an)' 
performance  advantage  for  the  HiDISC  architecture. 
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Figure  10:  Stressmark  performance  results 


3J  Discussion 

The  simulation  results  show  that  the  HiDISC  system  performs  quite  well  in  general  w  ith 
the  DIS  benchmarks.  This  is  because  the  DIS  benchmarks  contain  many  long  latency 
floating-point  operations  which  can  effectively  hide  any  long  memory  latency.  In  other 
words,  the  amount  of  computation  code  and  that  of  memor)  access  code  are  well 
balanced  in  the  DIS  benchmark  Suite.  Conversely,  the  size  of  the  Stressmark 
computation  code  is  much  smaller  than  that  of  the  memory  access  code.  It  is  one  of  the 
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main  reasons  for  the  somewhat  weaker  performance  results  observed  in  the  ease  of  the 
Stressmark  Suite. 

Four  I)IS  Benchmarks  Results  (Figure  *>) 

Four  DIS  benchmarks  outperform  the  baseline  superscalar  architecture  particularly  with 
higher  memory  latencies.  More  particularly,  the  Method  of  Moments  is  quite  robust 
when  faced  with  longer  memory  latencies.  It  contains  enough  computation  code  which 
can  hide  the  longer  access  latency.  Also,  the  dependencies  between  the  Computation 
Stream  and  the  .Access  Stream  are  comparatively  not  heavy  and  provide  enough  slip 
distance  to  hide  any  long  memory  latency. 

In  the  case  of  the  Multidimensional  Fast  Fourier  Transform,  HiDlSC  also 
outperforms  the  superscalar  architecture.  Flowcver,  the  results  show  a  weaker 
performance  for  long  memory  latencies  even  with  the  HiDlSC  model.  Indeed,  the 
synchronization  between  the  AS  and  the  CS  limits  the  |x>ssible  slip  distance  between  the 
two  streams.  It  is  due  to  the  data  dependencies  between  the  two  streams:  frequent  data 
dependencies  between  the  Access  Stream  and  the  Computation  Stream  cause  loss  of 
decoupling  events.  Usually,  it  is  the  CS  which  has  to  wait  for  a  data  element  to  be 
produced  by  the  AS  (although  the  converse  is  also  sometimes  true).  When  this  happens, 
the  slip  distance  between  the  two  processors  is  reduced  significantly,  one  processor  must 
wait  for  the  other  and  an)  advantage  is  negated  since  there  is  no  more  parallelism 
betw  een  the  two  processors. 

The  Data  Management  and  the  Ray-Tracing  benchmarks  are  not  affected  by 
longer  memoi)'  latencies  in  either  case.  It  should  be  noted  that  the  working  set  for  the 
Data  Management  benchmark  fits  quite  well  in  the  cache.  As  should  be  expected,  a 
program  with  a  small  working  set  is  not  a  good  candidate  for  a  prefetching  architecture 
such  as  the  HiDlSC.  Conversely,  due  to  the  prefetching  of  the  CMP,  FFT  exhibits  better 
performance. 
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Six  Stivssmarks  Results  (rigure  HI) 

Generally,  the  Stressniark  codes  are  too  small  and  contain  too  many  operations  which  arc 
concerned  only  with  data  access.  Therefore,  the  amount  of  computation  code  to  hide  data 
access  is  not  sufficient.  The  HiDlSC  produces  weaker  results  in  four  Stressmarks  — 
Ujxlate,  Field,  Matri.x  and  Neighborhood  -  out  of  the  si.x  Stressmarks.  However,  the 
remaining  two  Stressmarks  -  the  Pointer  and  the  Transitive  Closure  -  advantageously 
exploit  the  characteristics  of  our  architecture. 

Besides  the  unbalanced  computation  and  access  code  ratio,  frequent  loss  of 
decoupling  is  another  main  reason  for  the  weak  performance  we  obsene  in  several 
Stressmarks.  Indeed,  four  Stressmarks  except  Pointer  and  Transitive  Closure  contain  too 
much  data  dependencies  and  frequent  synchronizations  between  two  streams. 

However,  in  the  Pointer  Stressniark  case,  pointer  chasing  can  be  executed  far 
ahead  since  it  does  not  require  the  computation  results  from  the  CP.  The  Transitive 
Closure  benchmark  also  produces  good  results  because  not  much  in  the  .AP  depends  on 
the  results  of  the  CP.  In  both  cases,  the  Access  Stream  can  run  far  ahead  of  the 
Computation  Stream:  a  sufficient  slip  distance  is  guaranteed  in  both  benchmarks. 

The  slip  distance  is  truly  inherent  to  the  instruction  mix  pattern  of  the  application: 
if  the  -Access  Stream  does  not  depend  much  on  results  from  the  Computation  Stream,  the 
.Access  Stream  can  run  earlier  and  maintain  a  high  slip  distance.  Pointer  and  Transitive 
Closure  exhibit  good  performance  for  the  same  reasons.  In  addition  to  the  possible  slip- 
distance  between  the  two  streams,  the  Stressniark  results  suggest  that  applications  which 
are  ideal  for  the  HiDlSC  would  be  well  balanced  in  terms  of  the  ratio  of  computation 
operations  over  memory  operations. 

Finally,  the  working  set  for  the  Stressniark  is  quite  small  and  the  baseline 
superscalar  architecture  does  not  suffer  from  many  cache  misses.  Three  Stressmarks 
(Update,  Field  and  NeighborliocH.1)  cannot  improve  even  with  the  prcfeching  of  the  CMP. 

Although  some  of  the  benchmarks  show  weak  performance,  the  fact  that  the 
Pointer  Stressniark  and  the  Transitiv  e  Closure  Stressniark  perform  better  that  the  baseline 
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superscalar  architecture  is  quite  encouraging  and  suggests  the  type  of  the  candidate 
applications  for  the  HiDISC  architecture. 

4.  Conclusions 

Current  high-level  programming  languages  and  all  supjx>iting  compilers  are  based  on  an 
underlying  sequential  programming  behavior.  This  is  confirmed  at  the  lower  level  where 
the  instruction  set  of  modern  microprocessors  are  based  on  a  sequential  model.  However, 
in  order  to  exploit  some  parallelism  at  the  instruction  level,  manufacturers  of  current 
prevailing  high  perfonnance  processors  have  considerable  changed  the  processor  internal 
structure.  Also,  several  features  of  dataflow  models  have  found  their  way  in  modern 
processor  architectures  and  compiler  technologies  such  as  register  renaming  and  dMiainic 
scheduling  [17].  Decoupled  architecture  is  one  such  technique  which  promises  to  bring 
improvement  to  the  performance. 

The  effectiveness  of  the  HiDISC  decoupled  architecture  has  been  demonstrated 
here  with  data  intensive  applications.  It  has  been  eloquently  shown  that  the  proiwsed 
prefetching  method  provides  better  ILP  compared  to  conventional  superscalar 
architectures.  However,  the  possible  loss  of  decoupling',  which  is  inherited  from  the 
sequential  behavior  of  the  programs,  stalls  the  processors  and  drops  utilization  in  some 
cases.  The  results  also  point  to  some  future  modifications  of  the  current  CMP  for 
effective  prcfctching. 

Clearly,  the  HiDISC  architecture,  as  designed,  will  shine  when  executing  data 
intensive  applications  because  they  contain  enough  computation  to  hide  long  memory 
latencies.  In  addition  to  that,  the  slip  distance  is  another  important  factor  which 
determines  overall  performance.  Too  many  data  dependencies  of  the  access  processor  on 
the  computation  processor  prevent  a  sufficient  slip  distance  from  developing.  Therefore, 
stream-like  applications  are  favored  for  the  HiDISC  system. 
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5.  Recommendations 


Based  upon  these  performance  results,  \vc  propose  some  improvements  to  the  basic 
HiDISC  architectures  in  order  to  make  it  fit  a  wider  v  ariety  of  applications. 

5.1  I’utiirc  Knliaiiccinents  to  the  HiDISC 

Although  the  independent  management  of  the  memory  hierarchy  prov  ides  an  opportunity 
to  implement  novel  prefetching  techniques,  the  HiDISC  architecture  suffers  from  two 
significant  weaknesses.  First,  the  frequent  synchronizations  between  the  AP  and  the  CP 
cause  stalling  of  the  processors  and  result  in  low  utilization.  Second,  the  CMP  code  is 
essentially  not  different  from  the  .AP  code.  Therefore,  all  the  load  instructions  are  forced 
to  run  on  the  CMP  as  prefetching.  However,  not  ever)  prcfctching  by  the  CMP  is 
necessary  and  helpful.  Necessary  enhancements  regarding  the  above  two  problems  will 
follow. 

The  frequent  svnehronizations  cause  loss  of  decupling  and  prevent  timelj’ 
prefetching.  Therefore,  each  processor  of  the  HiDISC  loses  many  C  PU  cycles  to  wait 
until  the  ncccssar>'  data  arrives.  To  solve  this  problem.  Simultaneous  MultiThreading 
{SM'D  should  be  added  to  the  HiDISC  architecture.  SMT  will  raise  the  utilization  by 
running  multiple  threads  simultaneously.  In  other  words,  in  a  multithreaded  HiDISC 
system,  SMT  would  raise  the  utilization  of  the  processors,  while  decoupling  would 
reduce  the  memory  latenc)  [22][23]. 

The  second  moelification  is  related  to  the  cuiTent  CMP  design.  The  main 
motivation  for  the  e.xistence  of  the  CMP  processor  is  to  reduce  the  cache  miss  rate  by  the 
.Access  Processor  by  timely  prefetching.  Therefore,  the  CMP  should  run  ahead  of  the  AP, 
just  like  the  .AP  runs  ahead  of  the  CP.  However,  in  the  basic  HiDISC  design,  the 
instruction  stream  for  the  CMP  is  quite  similar  to  the  Access  Stream,  which  is  a 
significant  limitation  as  far  as  the  effectiveness  of  the  prefetching  is  concerned.  Our 
original  design  executes  every  load  instruction  on  CMP.  However,  if  the  cache  line 
already  resides  in  cache,  those  prefetches  become  redundant  operations. 
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Only  future  probable  miss  instructions  can  benefit  from  the  prefetches  by  the 
CMP.  However,  the  current  CMP  is  too  heavy  and  involves  performing  too  many 
redundant  operations.  Hence,  in  order  to  prefetch  more  efficiently  into  the  cache,  vve 
must  dev  elop  better  methods  so  that  vve  execute  onlv  probable  miss  instructions. 

VVe  define  Cache  Miss  .Access  Slice  (CM.AS),  which  is  a  part  of  the  .Access 
Stream,  consisting  of  the  probable  cache  miss  instruction  and  its  parent  instructions.  The 
probable  cache  miss  instructions  can  be  found  using  the  cache  access  profile  [27][28]. 
The  CM.AS  is  executed  on  existing  CMP  in  a  multithreaded  manner.  Indeed,  the  CMP  is 
an  auxiliary  processor  for  speculative  execution  of  probable  cache  miss  instructions. 

5.2  Ilexi-niSC 

One  of  the  most  striking  characteristics  of  the  HiDISC  architecture  is  its  inherent 
flexibility  and  how  it  \  ields  highly  efficient  execution  of  a  large  variety  of  loop-based 
programs  with  little  or  no  temitoral  locality.  This  fundamental  feature  is  further  extended 
in  the  proposed  Flexi-DISC.  This  new  architecture  will  be  targeted  to  a  wide  variety  of 
more  complex,  numerical  and  non-numerical  applications  (such  as  .Automatic  Target 
Recognition). 

While  the  original  HiDISC  is  centered  around  three  processors  with  well  defined 
roles,  the  Flexi-DISC  maintains  the  three  roles  of  the  CP,  the  .AP,  and  the  CMP  at  the 
kernel  of  its  fundamental  machine  model  but  elev  ates  it  to  a  more  sophisticated  concept: 
the  two  highest  levels  (.Aceess  and  Cache  Management)  are  still  handling  the  transfer  of 
data  between  the  memory  system  and  the  Computation  level  while  the  third  level  remains 
in  charge  of  the  computation  per  sc.  This  can  be  represented  as  the  three  concentric  rings 
on  Figure  1 1:  the  Computation  Kernel  (CK),  the  Low-level  Cache  Access  Ring  (LC.AR), 
and  the  Memory  Interface  Ring  (MIR). 
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I'igurc  1 1:  The  Jlircc-Riiig  I'lc\i-l)ISC'  Architccliirc 

Tlic  fundamental  obsen  ation  which  leads  to  this  partitioning  comes  from  the  fact 
that  the  types  of  applications  (Mcmoiy  Intensive)  we  have  been  targeting  arc  both  \  aried 
in  nature  and  also  inherently  highi)  dynamic  at  execution  time.  This  may  mean  that 
memoiy  access  patterns  could  range  from,  say.  single  use  of  an)'  data  element  (no 
tcmisoral  localit)),  to  multiple  reuses  (high  temporal  locality).  Consequently,  the 
bandwidth  and  types  of  pipes  to  and  from  the  memor)'  system  must  adapt  to  the  changes, 
whether  the)  be  static  or  d)namic.  We  plan  on  centering  the  vs  holc  architecture  around  a 
highly  rcconfigurablc  Computation  Kernel. 

The  central  Computation  Kernel  is  based  on  an  array  of  simple  processors  which 
can  be  dsnamically  rearranged  to  meet  the  demands  of  the  current  application.  It  can 
even  be  partitioned  into  sub-arrays  which  arc  allocated  to  different  portions  of  the 
application  (or  even  to  different  applications  as  needed).  Such  a  |x>wcrful  computation 
kernel  requires  an  equally  powerful  "pipeline''  to  feed  it  information  to  and  from  the 
mentor)'  system.  Further,  the  variety  of  target  applications  makes  the  mentor)'  accesses 
unpredictable.  This  means  that  depending  oit  the  application  (or  even  the  phase  of  a 
given  computation),  the  amount  of  mentor)'  traffic  may  fluctuate,  and  the  prefetching 
mcchanisnts  must  be  allowed  to  adapt  to  the  situation  at  hand.  This  also  means  that 
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instead  of  allowing  a  single  proeessor  for  the  Cache  Access  role  and  another  for  the 
Cache  Management  role,  a  pool  of  identical  processing  units  must  be  made  av  ailable  to 
the  two  roles  combined.  This  sharing  enables  a  highly  efficient  dynamic  partitioning  of 
the  resources  and  their  run-time  allocation  to  the  two  outer  rings  (the  Low-level  Cache 
.Access  Ring,  and  the  Memory  Interface  Ring). 

The  technology  developed  for  the  HiDISC  compiler  can  be  e.xpanded  to  include 
the  rearrange  ability  of  the  machine,  as  well  as  the  partitioning  it  will  undergo  in  the 
presence  of  multi-headed  applications. 


rigiire  12:  Multiple  application  sharing  of  the  FIcxi-DISC  model 
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Appendix  A:  Compiler  and  Simulator  Description 

The  compiler  and  the  sinuilator  arc  based  on  the  SimpleScalar  3.0  tool  set.  The  two  tools 
have  been  designed  by  modifying  sim-oiitorder.c.  The  fust  tool  is  sim-pfg.c,  which  takes 
care  of  the  whole  compiling  procedure  and  the  other  one  is  sini-dtiDuis.c,  which  c.xactly 
matches  the  HiDlSC  simulator.  This  appendix  gives  a  detailed  description  of  the  tools. 


A.I.  Compiler  I’ool:  sim-pfy.c 

sini-pf^.c  is  the  source  code  (C)  for  the  HiDlSC  compiler.  The  main  tasks  of  sini-plj'.c 
are:  1.  Deriving  the  Program  Flow  Graph  and  2.  Separating  the  streams.  The  input  for 
sim-pf^.c  is  a  binary  executable  for  SimpleScalar  while  the  output  is  a  binar)'  executable 
for  the  HiDlSC  architecture  with  the  separation  information. 


Benchmarks 

r 

I 

Sequential  execulahle 


HiDlSC  execulahle 

Figure  13:  I’he  HiDlSC  Compiler 

Figure  13  shows  the  procedure  inside  the  HiDlSC  compiler.  The  two  boxes 
perform  the  operations  mentioned  earlier. 


Deriving  Program  Flow  (iraph  (PF(i) 

The  Program  Flow  Graph  delivers  the  data  dependency  information  between  instmetions. 
The  dataflow  relationship  between  instructions  must  first  be  defined  in  order  to  get  the 
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backward  slice  of  a  certain  target  instruction.  .After  this  procedure,  each  access  related 
instruction  can  |X)int  to  the  parent  instructions  based  on  the  source  register  name.  The 
main  procedure  is  named  pf^_const( }.  Its  detailed  mechanism  is  described  in  Figure  14. 
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I’igure  14:  neriviiig  PF(;  Cirapli 


The  data  structure  for  each  instruction  has  been  defined  as  pj]i_slalion.  .After  the 
instruction  is  decoded,  a  dedicated  pjg_stalion  is  assigned.  The  first  procedure  consists 
in  accessing  the  register  table  based  on  the  source  register  name,  (referred  to  as  (D  in 
Figure  14).  The  register  table  gives  the  pointer  to  the  instruction  (actually,  the  pointer  to 
pfg_sta{ioii  of  the  instruction,  referred  to  as  (2)  in  Figure  14  )  v\  hich  last  updated  the 
source  register.  Finally,  the  decoded  instruction  can  have  the  pointer  for  the  parent 
instructions  referred  to  as  (3)  in  Figure  14. 

This  is  how  we  uncover  the  parent  instructions  of  a  load  store  instinction.  Later, 
we  can  proceed  with  a  backward  chasing  procedure  in  order  to  extract  the  backward  slice 
based  on  the  PFG  information. 
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Separatin*»  Stream 

The  stream  separation  is  based  on  the  rejiister  dependencies.  First,  when  the  decoded 
instruction  is  either  a  load  or  a  store  instruction,  it  is  immediately  assigned  to  the  Access 
Stream.  .After  that,  the  backward  cliasing  procedure  is  initialized  (procedure  named 
chusing^)arcnts( )  is  called).  Essentially,  it  is  function  call  which  is  recursively  applied 
until  it  reaches  an  instruction  which  has  been  pre-detennined  to  belong  to  the  .Access 
Stream. 

The  PFG  infomiation  from  the  previous  step  yields  the  pointers  to  the  parent 
instructions.  Therefore,  the  cluisin^_/Mirents(  )  prcx:edure  basically  returns  all  the 
pointers  to  the  parent  instructions. 

.After  the  instruction  is  detected  as  belonging  to  the  Access  Stream,  the  stream 
separation  information  is  updated  inside  the  binary  file.  Since  each  instruction  of  the 
SimpleScalar  binary  includes  an  additional  annotation  field,  those  extra  bits  can  be  used 
to  can  y  the  separation  infomiation. 

.\.2.  Simulator:  siin-diimas.c 

The  Hi  DISC  simulator  has  been  designed  by  modifying  the  sini-outordcr.c  module  of  the 
SimpleScalar  3.0  tool  set  [5].  The  major  modifications  consist  in:  I.  implementing  the 
three  processors  of  the  HiDISC  and  2.  implementing  the  communication  mechanisms 
(queues)  between  those  three  processors.  As  in  the  original  SimpleScalar  simulator,  the 
HiDISC  simulator  is  also  an  execution-driven,  cycle-  time  simulator. 

To  implement  the  three  processors  of  the  HiDISC,  we  basically  copied  three  times 
the  pipelined  RISC  processor  of  the  SimpleScalar  tool  set  and  tailored  each  so  they  would 
correspond  to  the  architecture  of  each  HiDISC  processor. 

After  the  decoding  stage,  each  processor  has  a  coiresponding  ready  list^  which  is 
the  instruction  stream  for  each  processor.  We  implement  three  different  functional  units 
which  are  unique  to  each  processor.  Procedure  ruii_issiie( )  of  the  sim-outorder.c  has 
been  copied  and  changed  to  niii_issup_cp(' rtm_issue_ap(  y,  and  ritu_i.ssue_cnip( }. 
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Each  function  detects  each  ready  list  and  finds  the  available  functional  unit  that  is 
assigned  to  the  conesponding  processor. 

The  need  for  comnuinication  can  also  be  detected  at  the  decoding  stage.  If  an 
instruction  requires  data  from  the  other  processor,  it  should  be  blocked  and  it  should  wait 
until  the  other  processor  sends  the  data.  The  queue  implementation  is  quite  easily 
handled  using  the  existing  link  operations  of  the  SimpleScalar  tool  set.  All  the  necessary 
source  data  is  linked  after  the  niu_dispatch(  )  procedure.  Therefore,  the  sending 
processor  can  “wake  up"  the  waiting  processor  just  like  ruu  station  in  sim-oiitorder.c. 

Communications  between  the  AP  and  the  CMP  are  achieved  through  the  data 
cache.  Therefore,  the  data  cache  is  designed  and  implemented  to  be  shared  and  accessed 
by  both  processors. 
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Appendix  B:  Raw  Performance  Data 


This  appendix  contains  all  the  simulation  results.  The  column  denoted  as  mem 
corrcsponds  to  the  various  memory  latencies.  The  column  marked  SS  contains  the 
performance  of  the  base  line  superscalar  architectures.  The  fourth  column  denoted  as 
HiDlSC  contains  the  performance  results  of  the  HiDISC  architecture  without  the  CMP 
processor.  The  remaining  two  contain  the  performance  results  with  the  CMP  enhanced 
pre-fetching  algorithms.  The  peiTormance  measures  arc  all  in  I  PC  (instructions  per 
clock). 
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Abstract 

This  paper  presents  ende  transformations  designed  to  take  adsaiitage  of  the  potential  2  orders  of  nia«nitiKle  baiid- 
ssidth  increase  available  in  a  PIM-based  architecture.  Using  an  image  processing  application  as  a  case  study,  sse 
demonstrate  hosv  code  transformations  can  exploit:  ( I)  tine-grain  parallelism  in  the  sside-ssord  processing  unit  to 
ma\inii/c  the  computation  performed  on  each  processor  cycle;  (2)  data  reuse  in  the  large  register  tile  to  avoid 
unneccssarv  memory  accesses  that  stall  the  processor;  and.  (.4)  page  mode  accesses  in  the  memory  array  to  mini- 
nii/e  the  cost  of  each  remaining  memory  access.  While  most  of  the  transformations  described  here  are  svell-knoss  n 
compiler  techniques,  in  Pl.M-based  systems  sve  require  a  ness  optimi/ation  strategy  to  meet  a  very  different  optimi¬ 
zation  goal  as  compared  to  conventional  approaches  focused  primarily  on  exploiting  locality  in  a  data  cache.  We 
demonstrate  the  importance  of  each  set  of  transformations  through  simulation  results. 

1.0  Introduction 

I  he  increasing  gap  between  processor  and  memory  speeds  is  a  well-known  problem  in  computer  architecture, 
w  ith  peak  processor  performance  increasing  at  a  rate  of  60%  per  year  w  hile  memory  access  times  improve  at 
merely  7%.  I-urther,  techniques  designed  to  hide  memory  latency,  such  as  multithreading  and  prefetching,  actu¬ 
ally  increase  the  memory  bandw  idth  requirements  [Burger96|.  Recent  VLSI  technology  trends  oiler  a  promis¬ 
ing  solution  to  bridging  the  processor-memory  gap:  integrating  processor  logic  and  memory  in  a  processing-in- 
meinory  (RIM)  chip.  Because  PIM  internal  processors  can  be  directly  connected  to  the  memory  banks,  the 
memory  bandw  idth  is  dramatically  increased  (up  to  2  orders  of  magnitude,  tens  or  even  hundreds  of  gigabits 
per  second  aggregate  bandw  idth  on  a  chip).  Latency  to  on-chip  logic  is  also  reduced,  dow  n  to  as  little  as  one- 
fourth  that  of  a  conventional  memory  system,  because  internal  memory  accesses  avoid  the  delays  associated 
w  ith  communicating  olTchip. 

.An  important  class  of  applications  well-suited  to  PIM-based  systems  arise  in  image  processing  and  other  mul¬ 
timedia  problems.  Such  applications  are  bandwidth  limited  because  they  perform  repeated  computations  on 
streams  of  data;  sometimes  the  applications  have  little  temporal  reuse  |Ranganathain)‘)|.  .At  the  same  time,  the 
applications  otlen  exhibit  inherent  spatial  locality  and  both  tine-grain  and  coarse-grain  parallelism.  These 
properties  map  well  to  PIM-based  architectures.  PlMs  exploit  spatial  locality  and  tine-grain  parallelism  by 
accessing  and  operating  upon  multiple  words  of  data  at  a  time,  and  exploit  coarse-grain  parallelism  by  spread¬ 
ing  independent  computations  throughout  the  memory.  1  hiis.  there  is  a  significant  opportunity  for  compiler 
technology  (or  clever  programmers)  to  achieve  very  high  performance  on  PIM-based  systems. 

In  recent  years,  researchers  have  proposed  many  PIM-based  architectures|Llliot‘)^).Gokhale‘)5.K.ang‘)d, 
()skiiP)8.Patterson^)7,Saulsbury‘)6,Suraga%,rorrellas(K)|.  but  little  attention  has  been  paid  to  developing  com¬ 
piler  technology  for  such  systems.  Nevertheless,  new  compiler  technology  is  needed  to  exploit  the  very  dilTer- 
ent  architectural  features  of  a  PIM  system.  ( )n-chip  memory  latencies  are  very  low.  .An  access  to  the  same  row 
111  the  memory  array  as  the  previous  access  (/.e.,  a  page  mode  access)  costs  only  a  few  cycles,  and  other 
accesses  (in  random  mode)  are  .1-4  times  slower,  but  still  quite  fast.  Because  of  this  lower  latency,  many  PIM 
devices  do  not  have  conventional  data  caches,  but  instead  rely  on  simple,  and  much  more  space-  and  power- 
efficient.  caching  mechanisms  within  the  memory  arrays  themselves  (for  example,  to  exploit  page  mode 
accesses)  [Llliothq.()skinh8.Saulsbury‘)6,Zawodny‘)8|.  To  exploit  available  on-chip  bandwidth,  many  PIM 
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chips  also  have  wide  data  paths  from  memoiy  to  processing  logic,  and  processors  that  can  operate  on  several 
words  of  data  in  one  processor  cycle  |()skin98.Patterson97|.  To  increase  on-chip  handwidth  further,  many  PIM 
chips  are  small-scale  multiprocessors,  with  processing  logic  sprinkled  throughout  the  memory 
I  Id  I  iot99. Ka ng9<), ( )sk  i n 98  ] . 

In  this  paper,  we  examine  code  transformations  to  exploit  potential  bandw  idth  of  a  particular  PIM-based  sys¬ 
tem  called  DIVA  (Data-lntensi\e  Architecture),  which  has  all  of  the  three  architectural  features  described  in 
the  previous  paragraph  (i.e.,  caching  only  within  the  memory  array,  w  ide  datapaths,  multiprocessor-on-a-chip) 
|llall99|.  Using  an  image  processing  application  as  a  case  study,  we  describe  how  the  code  could  be  etVectively 
transformed  to  tailor  it  to  the  Dl\ A  architecture,  d  hese  transformations  accomplish  several  goals,  exploiting: 

( I )  fine-grain  parallelism  in  the  wide-word  processing  unit  to  maximize  the  computation  performed  on  each 
processor  cycle;  (2)  data  reuse  in  the  large,  wide  register  file  to  avoid  unnecessary  memory  accesses  that  stall 
the  processor;  and,  (d)  page  mode  accesses  in  the  memory  array  to  minimize  the  cost  of  each  remaining  mem¬ 
ory  access.  We  discuss  the  techniques  that  must  be  supported  by  a  compiler  to  perform  these  transformations. 
While  most  of  the  transfomiations  described  here  are  well-known  compiler  techniques,  in  DIV.\  we  require  a 
new  optimization  strategy  to  meet  a  very  different  optimization  goal  as  compared  to  conventional  approaches 
focused  primarily  on  exploiting  locality  in  a  data  cache. 

In  this  paper,  we  assume  that  global  data  and  computation  partitioning  has  been  performed  across  the  system 
I  Anderson9d|,  and  we  concentrate  on  how  to  optimize  code  for  a  single  PIM  processor.  We  describe  each 
transformation  used,  and  illustrate  how  it  impacts  the  performance  of  our  application.  We  present  simulation 
results  demonstrating  the  contribution  of  the  transformations.  Overall,  we  find  that  these  transformations 
reduce  the  number  of  memory  accesses  by  almost  a  factor  of  d50.  w  ith  most  of  the  remaining  memorv'  accesses 
in  page  mode.  We  also  see  a  lactor  of  Id  reduction  in  dynamic  instructions  executed,  for  this  paper,  we  coded 
the  application  in  the  DIVA  ISA  and  performed  the  transformations  by  hand.  We  are  using  this  and  case  studies 
of  other  applications  to  guide  the  development  of  a  compiler  transformation  algorithm  to  automatically  per¬ 
form  these  transformations,  and  we  are  using  the  significant  analysis  and  code  transformation  infrastructure  in 
the  Stanford  Sl.iII-  compiler  as  a  basis  for  our  implementation. 

I  he  remainder  of  the  paper  is  organized  into  three  sections  and  a  conclusion.  The  next  section  presents  an  over¬ 
view  of  the  DIVA  architecture.  Section  d  describes  the  image  processing  code  we  use  as  a  case  study  and  the 
compiler  transformations  applied  to  it.  Section  4  presents  simulation  results. 

2.0  Overview  of  DI\’4  System  Architecture 


Fifiurc  I:  DIN  A. System  Oi'^aiii/ation 


In  I'igure  1,  we  show  a  small  set  of  PIMs  connected  to  a  single  external  host  through  a  host-memor\'  interface; 
through  this  interface  the  host  processor  performs  standard  reads  and  w  rites,  augmented  as  discussed  in  Section 
2.2.  fhe  PIM  chips  communicate  through  separate  PIM-to-PIM  channels  to  bypass  the  system  bus  w  ith  addi¬ 
tional  memory  tralTic,  used  to  spawn  computation,  gather  results,  synchronize  activity,  or  simply  access  non¬ 
local  data,  fhe  separate  interconnect  is  provided  because  PIM-to-PlM  communication  requires  greater  band- 


268 


width  than  can  bo  achieved  with  a  conventional  memory  bus.  Because  this  paper  focuses  on  the  activity  within 
a  PIM  node,  we  omit  further  description  of  the  PIM-to-PIM  interconnect,  which  isdescribed  in  |Ilalb)‘)|. 

2.1  I’lM  N'l.SI  Compuiiciit 

.'\  IMM  is  a  VI. SI  memory  device  augmented  with  general  and  special-purpose  computing  hardware.  A  PIM 
may  consist  of  multiple  nodes,  each  of  w  hich  are  comprised  of  a  few  megabytes  of  memory  and  a  node  proces¬ 
sor.  I  he  inset  in  I'igure  I  shows  a  PIM  with  four  nodes.  I'he  nodes  on  a  chip  share  resources  for  communica¬ 
tion  with  the  rest  of  the  system.  .Vs  a  result  each  chip  contains  a  single  PIM-to-PIM  interface  and  a  host 
interface.  We  anticipate  that  DIV.V  PIMs,  like  many  other  PIM  chips,  will  be  split  roughly  60%  memory  and 
40%  logic  (rellecting  the  importance  of  memory  density). 

Within  a  single  node,  shown  in  I'igure  2.  the  processing  logic  consists  of  a  standard  scalar  microprocessor 
including  a  floating-point  unit  and  a  special  DIV.V  wide-word  functional  unit  that  performs  operations  on  25(v 
bit  aggregate  objects  stored  within  a  row  of  the  local  memory  array.  I'he  wide-word  unit  can  be  used  to  perform 
bit-level  operations  such  as  simple  pattern  matching,  or  higher-order  computations  such  as  searches,  and  asso¬ 
ciative  and  commutative  reduction  operations.  Hie  wide-word  unit  has  a  large  register  file,  w  ith  32  256-bit  reg¬ 
isters.  Details  on  a  related  w  ide-word  unit  are  discussed  elsewhere  |  Brockmaib)')]. 

During  execution,  data  is  transferred  directly  from  the  memory  array  into  the  register  files;  there  is  no  on-chip 
data  cache.  Instead,  we  use  the  sense  amps  in  the  memory  array  as  a  small  data  cache,  holding  the  full  2k-bit 
row  selected  from  the  previous  memory  access.  If  two  consecutive  accesses  are  w  ithin  the  same  memory  row, 
the  second  access  is  referred  to  as  a  paifc  mode  access.  A  page  mode  access  is  much  faster  than  an  access  to  an 
arbitrary  memory  row  (a  random  mode  access),  because  it  does  not  pay  the  penalties  for  clearing  the  sense 
amps  and  loading  a  new  row.  In  D1V.<\,  we  assume  there  is  roughly  a  factor  of  3  dilTerence  in  latency  for  page 
mode  and  random  mode  accesses. 


Fijliirc  2:  l)l\'A  Node  Or^ani/atioii 


I'he  instructions  supported  by  the  w  ide-word  unit  resemble  those  of  multimedia  ISA  extensions  such  as  MMX 
and  Altivec,  but  in  the  case  of  DIV.A,  because  the  data  comes  directly  from  memory  at  low  latency,  we  can 
expect  much  better  performance  for  applications  that  do  not  make  elfective  use  of  cache.  The  architecture  also 
supports  direct  transfers  of  data  betw  een  register  files,  rather  than  going  through  memory  as  in  .Vltivec. 
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2.2  Host-Memory  Intertacc 

An  iindorlying  goal  is  that  D1\  A  PIM  dcs  ices  can  also  serve  as  conventional  memory,  so  that  they  could  he 
used  as  smart-memory  coprocessors  in  a  standard  system.  Phis  goal  motivated  a  design  of  the  PIM  VLSI 
device  to  include  a  host  interface  consistent  w  ith  the  standard  memory  interface  typical  of  commercial  memo¬ 
ries.  In  keeping  with  this  goal,  vve  would  also  like  to  package  the  PlMs  in  DIMM  modules  with  provisions  for 
top-plane  interconnections  between  the  memory  chips  to  support  the  PIM-to-PlM  communication  fabric.  How¬ 
ever,  unlike  commercial  memories,  computation  activities  give  rise  to  new  problems:  how  to  communicate 
internal  exceptions  and  possible  memory  busy  conditions  to  the  host  system.  1  hese  issues  are  being  addressed 
as  part  of  the  larger  system  architecture. 

2.3  DIVA  Memory  Model 

1  he  Dl\.\  memory  model  supports  a  globallv  addressable,  distributed  address  space  across  the  system.  Coher¬ 
ence  between  the  host  and  PlMs  must  be  enforced  in  software.  PIM  nodes  communicate  using  parcels,  a  vari¬ 
ant  of  active  messages,  where  messages  are  directed  to  objects  rather  than  nodes.  Segment  registers  at  each 
PIM  node  support  very  fast  on-chip  address  translation  for  local  addresses;  a  home  node  provides  translation  if 
the  address  is  non-local,  further  details  on  the  DlV.k  memory  model  can  be  found  elsewhere  [llallOOI. 

3.0  Code  Transformations 

In  this  section,  we  describe  code  transformations  for  exploiting  the  enormous  bandwidth  available  on  PIM- 
based  systems  such  as  DI\  /\.  While  most  of  these  transformations  are  well-know  n  compiler  techniques,  PIM- 
based  systems  require  a  new  optimization  strategy  to  meet  dilTerent  optimization  goals  as  compared  to  conven¬ 
tional  approaches,  which  focus  primarily  on  exploiting  parallelism  and  locality  in  a  data  cache. 

Our  first  optimization  goal  is  to  achieve  high  bandw  idth  utilization  by  exploiting  fine-grain  parallelism  in  the 
w  ide-vvord  processing  unit.  Techniques  for  exploiting  fine-grain  parallelism  in  the  wide-word  unit  also  apply  to 
multimedia  extensions  such  as  MMX  and  AltiVec,  w  hich  have  wide  register  files  and  allow  multiple  operations 
in  a  single  processor  cycle.  Once  line-grain  parallelism  is  exposed,  the  other  optimization  goals  are  to  exploit 
data  reuse  in  the  wide-word  registers  to  avoid  unnecessary  memory  stalls,  and  to  take  advantage  of  page  mode 
accesses  in  the  memory  array  to  minimize  the  cost  of  the  remaining  memory  accesses. 

To  accomplish  these  optimization  goals,  we  rely  on  several  analyses  and  code  transformations,  most  of  w  hich 
are  well-known.  Parallelization  analysis,  which  includes  data  dependence  analysis  and  array  dala-fiow  analy¬ 
sis.  identifies  loops  whose  iterations  can  be  executed  safely  in  parallel.  Reuse  analysis  identifies  loop  iterations 
that  access  the  same  data  (temporal  reuse)  or  distinct  data  in  the  same  cache  line  (spatial  reuse).  In  addition  to 
these  analyses,  several  code  transformations  are  required;  the  safety  and  profitability  of  the  transfonnations  are 
based  on  the  above  analyses.  Loop  interchange  two  tightly  nested  loops  sw  itches  the  inner  and  outer  loop,  and 
is  used  both:  ( I )  to  move  a  parallel  loop  to  a  particular  position  in  the  loop  nest  (innermost  for  fine-grain  paral¬ 
lelism  or  outermost  for  coarser  granularity);  and  (2)  to  move  reuse  to  an  innemiost  position.  Loop  unrolling 
creates  multiple  copies  of  a  loop  body  and  modifies  the  loop  control  accordingly.  Statement  reordering  reorga¬ 
nizes  statement  execution  while  preserv  ing  data  dependences.  Loop  fusion  involves  combining  two  adjacent 
loops  with  the  same  loop  control  into  a  single  loop,  used  to  promote  reuse  between  the  loops.  reorders 
the  iterations  m  a  loop  nest  to  bring  accesses  to  the  same  data  closer  together  in  the  iteration  space.  Parallel 
reductions  are  transformed  versions  of  commutative  and  associative  operations  whose  iterations  can  he  reor¬ 
dered;  a  particular  implementation  of  a  parallel  reduction  is  to  perform  independent  operations  on  private  cop¬ 
ies  of  the  variable  and  accumulate  the  partial  results  to  the  global  copy  of  the  variable.  Register  allocation  for 
array  variables  uses  data  dependence  analysis  and  loop  transformations  to  map  array  variables  to  registers.  In 
DI\A,  we  have  also  identified  the  need  for  a  new  transformation,  s7;/7//>ig  within  a  wide  register  between  ofvr- 
ations  to  exploit  spatial  reuse. 

We  are  using  the  Stanford  SUIL  compiler  as  a  basis  for  our  implementation.  With  the  exception  of  statement 
reordering,  array  register  allocation,  and  the  shirt  operation,  implementations  of  the  required  analyses  and  code 
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ransrormations  arc  already  present  in  SUIT  |\Volf*^)2||IIall95|.  I  hese  transformations  must  be  performed  in 
conjunction  with  global  data  and  computation  partitioning,  which  exploits  coarse-grain  parallelism  by  parti¬ 
tioning  the  computation  across  the  IMMs  |AndersonO.>|,  for  which  SUIf  also  has  an  implementation.  Our  cur¬ 
rent  research  involves  developing  a  decision  algorithm  for  using  these  analyses  and  transformations,  reworking 
the  existing  decision  algorithm  to  focus  on  the  unique  optimization  goals  in  the  Dl\'\  architecture. 

We  demonstrate  how  to  employ  these  transformations  for  I’lM-based  architectures  with  a  case  study  from  a 
template-matching  code  called  Sl.D  from  Sandia  National  Laboratories.  Ibis  code  performs  a  correlation 
between  an  image  and  a  series  of  templates  to  match  the  templates  to  w  indows  w  ithin  the  image;  it  sums  the 
image  gray-scale  \alues  for  pixels  that  are  nonzero  in  the  template.  .\  pictorial  description  and  a  simplified,  but 
representative  loop  nest  is  show  n  in  figure  In  the  figure,  we  show  the  image  on  the  left,  and  the  templates  on 
the  right,  fhe  two  innermost  loops  perform  a  correlation  between  a  single  template  and  a  particular  window 
w  ithin  the  image;  the  two  outer  loops  move  the  w  indow  of  interest  w  ithin  the  image. 


Image  template 


for  (i  row  0;  irow  <  .^2;  irow  •  • ) 
lor  (icol  =  0;  icol  <  .■!2;  icol  • ) 
for  (tro\v=  0;  irow<  .^2:  trow  ++) 
for  (tcol  0;  tcol  <  }i2:  tool  •  ^ ) 
if  (template|trow  ||tcol|  !=  0) 

corr|  irow  1 1  icol  |  +=  i mage  |  i row  +  trow  ][  icol  •  tcol  | ; 


.  *  traverse  image  row  s  */ 

/*  traverse  image  columns 
/*  traverse  template  rows  *, 

/*  traverse  template  columns*. 


Figure  3:  Original  loop  nest  from  template-matching  code 

■fhe  rest  of  this  section  shows  how  this  loop  nest  is  transformed  to  better  exploit  features  of  the  DI\A  system. 
We  assume  that  this  work  is  done  in  conjunction  with  global  data  and  computation  pailitioning,  which  exploits 
coarse-grain  parallelism  by  partitioning  the  computation  across  the  PI  Ms  |.\nderson‘).^|.  The  SUIf  compiler  is 
able  to  partition  the  computation  by  giving  dilTerent  templates  to  dilTerent  PIMs,  w  ith  no  inter-PIM  communi¬ 
cation. 

3,1  .Step  1:  Fine-(irain  Parallelism 

fhe  most  important  opportunity  for  high  bandwidth  utilization  is  to  take  advantage  of  the  fine-grain  parallel¬ 
ism  in  the  wide-word  instructions,  f  igure  4  shows  how  this  is  achieved  in  our  example  loop  nest.  Because  each 
pixel  is  represented  by  an  8-bit  object.  .'?2  pixels  can  be  processed  in  a  single  processor  cycle,  fhe  innermost 
loop  is  transformed  to  exploit  the  w  ide  operations,  consisting  of  loads,  alignment,  a  pairwise  logical  and,  and  a 
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macro  that  implements  a  leckiclion  sum.  1  he  reduction  sum  is  a  4-stai.>e  reduction  operation  that  combines  por¬ 
tions  of  the  correlation  register  until  the  sum  ofall  elements  is  computed. 


for  (i  row  =  0;  irow  <  32;  irow  ++) 
for  (icol  =  0;  icol  <  32;  icol  ++) 

for  (trow=  U;  trow  <  32;  trow  •  • )  1 

*  loop  tcol  becomes  sequence  of  wide  operations  ♦ 
wid  WRl  1,  &(image[irow  trow||icol|; 
wid  \VR12.  &(image[irow-trow||icol-32|; 
align  (WRl  I.  WRr2.  icol); 
wid  WR15,  &(teniplate[trow|[0]); 
wmul  WRl,  WRl  I.  WRI3; 
corr|  irow ||  icol  |  +=  reduction_sum  ( WR  I ); 


.*  traverse  image  rows*/ 

/*  traverse  image  columns 
/♦  iras  erse  template  rows  */ 

/♦  load  lower  halfof  image  row  *.' 

.  *  load  higher  half  of  image  row 
/*  align  to  w  ide  register  WR  1 1  * 

.  *  load  template  row  ♦/ 

/*  select  pixels  according  to  template  *.' 
/♦  add  up  selected  pixels  *  ' 


Figure  4:  Loop  nest  after  transformations  for  ilne-^rain  parallelism. 


We  use  data  dependence  analysis  and  reduction  recognition  to  identify  this  line-grain  parallelism.  Further,  we 
must  employ  reuse  analysis  to  identify  parallel  loops  for  which  there  is  also  spatial  reuse.  I  he  innermost  loop 
must  perform  the  same  operation  on  adjacent  (or  nearby,  w  ith  uniform  stride)  elements  to  be  able  to  exploit  the 
wide  word  instructions.  In  some  cases,  loop  transformations  such  as  loop  interchange  are  needed  to  move  the 
desired  parallel  loop,  w  ith  spatial  reuse,  to  the  innermost  position. 

3.2  Step  2:  Spatial  Reuse  in  I.ai'ge  Register  File 

Once  parallelism  is  exploited,  our  next  priority  is  to  eliminate  as  many  accesses  to  memory  as  possible,  to 
avoid  wasting  the  bandwidth  to  memory  stall  cycles.  Since  we  do  not  have  a  cache,  we  must  reuse  data  w  ithin 
registers  as  much  as  possible.  We  exploit  both  spatial  and  temporal  reuse  in  the  wide  registers.  Spatial  reuse  is 
possible  because  a  load  into  a  w  ide  register  fetches  several  consecutive  words  in  one  transfer;  for  example,  in 
the  innermost  loop  from  Figure  .3,  spatial  reuse  occurs  because  several  arrays  are  accessed  by  columns  (assum¬ 
ing  row -major  ordering). 

In  the  icol  loop,  there  is  both  spatial  and  temporal  reu.se.  Each  iteration  of  icol  operates  on  a  subrow  of  the 

image  consisting  of  elements  (icol . icol  i  3 1),  and  therefore  consecutive  iterations  of  icol,  say  i  and 

i  I  I,  operate  on  elements  (i, ....  i  i  31)  and  (i  i  I . .  ii  32).  As  illustrated  in  Figure  4,  each  iteration  of  icol 

accesses  a  new  image  pixel  that  is  consecutive  in  memory  to  the  last  pixel  accessed  in  the  previous  iteration. 
Also,  pixels  (ill,  ...,  i  1 31),  accessed  in  iteration  i ,  are  reused  in  iteration  ill.  To  exploit  the  reuse  in  this 
loop,  the  loop  nest  is  transformed  by  interchanging  loops  icol  and  trow,  making  loop  icol  what  is  now  the 
innermost  loop.  On  each  iteration  i  of  loop  icol,  the  subrow  (i,  ...,  i  1 3 1 )  is  brought  to  a  wide  register  by 
shifting  the  data  in  wide  registers  WRl  1  and  WRl 2  so  that  pixel  i-l  is  shifted  out  and  pixel  i  1 32  -  I  is  shifted 
into  the  last  byte  of  WR  1 1 .  Since  the  same  wide  words  are  loaded  from  memory  on  all  iterations  of  loop  icol, 
the  memory  accesses  can  be  moved  outside  the  loop.  While  this  shift  operation  is  an  unusual  compiler  tech¬ 
nique  specifically  for  operations  on  wide  data  types,  detecting  its  applicability  is  straight fonvard,  involving 
checking  the  dependence  distance  on  the  loop  for  small,  constant  distances.  Also,  the  number  of  accesses  to  the 
correlation  matrix  is  reduced  by  exploiting  the  spatial  reuse  ofcorr  [irow]  [icol]  in  loop  icol.  .A  corre¬ 
lation  row  is  loaded  into  a  wide  register  and  on  each  iteration  of  loop  icol ,  the  correlation  value 
corr  [icol]  is  updated  and  stored  back  in  the  wide  register.  Figure  5  shows  the  resulting  loop  nest  after  these 
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transtbrmations. 


for  (irow  0:  irow  <32;  irow  •  • ) 
for(tro\v=  II.  trow  <  32;  trow  ++) 

wid  \VR  1 1.  &(  imagel  irow  -  trow ][ii]); 
wid  \VR12.  &( imagel irow+trow  ][32 1  (; 
wId  \VR  15,  &(tomplatc|trow|[0|); 

wid  \VR20.  &lcorr|irow|[ii|); 

for  (icol  =  II;  icol  <  32;  icol  ++) 
wimil\VRI,\VRIl.\VR15; 
\VR20|icol]  +=  rediiction_siim  (WRl ); 
shifLright  (\VR 1 1.  \VR 1 1 .  \VR  12); 


/*  traverse  image  rows  */ 

/*  traverse  template  rows  */ 

/*  load  lower  half  of  image  row  */ 

/*  load  higher  half  of  image  row  */ 

*  load  template  rov\-  */ 

/*  load  correlation  row  */ 

/*  traverse  image  columns  *. 

/*  select  pixels  aceording  to  template  * 

*  add  up  selected  pixels  *, 

/*  shift  image  row  b)  one  pixel  */ 


wst  \V'R2().&( coit[ irow  |[ 0| );  /*  store  correlation  row  ♦/ 

< 


I'igiirc  5:  l.oop  nest  after  spatial  reuse  transformations 

Elxploiting  temporal  reuse  in  the  wide  registers  is  also  important,  not  only  for  reducing  the  number  of  memory 
accesses,  but  also  for  reducing  potential  intervening  accesses  that  would  displace  the  open  memory  row,  and 
result  in  subsequent  larger  random-mode  latencies.  Temporal  reuse  in  the  w  ide  register  file  is  exploited  in  Step 
4  below. 

3.3  Step  3:  Maxinii/ino  Page  Mode  Memory  Accesses 

The  transformations  for  exploiting  reuse  result  in  a  significant  reduction  in  the  number  of  memory  accesses. 
However,  since  there  are  accesses  to  three  distinct  arrays  (image,  template  and  correlation)  in  the  body  of  loop 
trow,  there  are  intervening  accesses  between  loads  of  consecutive  image  rows  and  consecutive  template  row  s. 
These  intervening  accesses  displace  the  current  open  memory  row  from  the  sense  amps  between  reuses  of  the 
same  memory  row,  and  as  a  result,  most  of  the  memory  accesses  that  reuse  a  memory  row  still  sulTer  from 
higher  random-mode  latencies. 

To  exploit  the  lower  page-mode  latencies,  we  must  reorder  the  loads.  In  the  example  of  Figure  5,  this  can  be 
accomplished  by  unrolling  loop  trow  and  grouping  the  memory  accesses.  We  then  fuse  together  the  unrolled 
loop  bodies.  Figure  6  shows  a  simplified  version  of  the  resulting  code,  in  which  loop  trow  is  unrolled  by  a  fac¬ 
tor  of  2,  for  illustration  purposes  only.  In  practice,  the  unrolling  factor  depends  on  the  size  of  the  w  ide  register 
file.  In  our  experiments,  we  actually  unroll  the  trow  loop  by  3,  resulting  in  4  copies  of  the  loop  body,  so  that 
the  result  will  fit  in  the  32  wide  registers. 

Identifying  potential  page-mode  accesses,  that  is,  memory-row  reuse,  is  equivalent  to  identifying  spatial  reuse 
at  a  wide  word  granularity.  In  other  words,  the  locality  analysis  that  identifies  spatial  reuse  in  caches  can  be 
extended  to  identify  loop  iterations  that  access  distinct  wide  words  in  the  same  memory  row.  Once  any  mem¬ 
ory-row  reuse  is  identified,  loop  transformations  may  be  applied  to  reduce  the  reuse  distance.  When  the  mem¬ 
ory-row  reuse  occurs  m  consecutive  iterations  of  a  loop,  but  there  are  still  inlenening  accesses  on  each 
iteration,  the  reuse  can  be  exploited  by  unrolling  the  loop  and  grouping  together  the  memory  accesses  with 


273 


/*  traverse  image  rows  ♦. 


memoiy-row  reuse,  as  in  the  example  in  l  igure  6. 

for  (irow  0;  irow  <  .'!2;  irow  ++) 

*  loop  trow  unrolled  by  2  *. 
for  (trow  (1;  trow  <  32;  trow  +=  2) 

/♦  load  2  image  rows  in  page  mode 
w  id  W'R  1 1,  &(image|irow  •  trow  ][n]; 
wid  \\'R12,  &(image|irow*trow  ][32|; 
wid  WR 1 3.  &(  imagel irow  •  trow  •  1  )|n]; 
wid  \\R14.  &(  imagel  irow -trow  ■  ll[32|; 

/*  load  two  template  rows  in  page  nuxle  */ 
wid  W'R  15,  &template|trow  ||0|; 
w  Id  W'R  16,  &template|trow  •  1  ]|0|; 

wid  W'R2n.  &(corr|irow  |[(i|); 

for  (icol  =  11;  icol  <  32;  icol  ++) 
wmulWRl,W'RII,WR15; 

W'R20|icol|  +=  reduetion_siim  (W'R  I ); 
shirt_right  (W'R  1 1,  W'R  1 1 ,  W'R  12); 
wmul  W'RI.  W'RI3,  W  R16; 

W'R20|icol]  •  reduct ion_siim  (W'RI ); 
shifLright  (W'RI3,  W'RI3,  W'R  14); 


/*  traverse  template  rows 

/♦  load  lower  half  of  image  row  ♦/ 
f*  load  higher  half  of  image  row  */ 

/*  load  lower  half  of  next  image  row  *■ 
/*  load  higher  half  of  next  image  row  ♦/ 


/♦  load  template  row  ♦' 

/♦  load  next  template  row  ♦/ 

t*  load  correlation  row  *i 

/*  trav  erse  image  columns  *, 

/*  select  pixels  according  to  template  ♦, 
/♦  add  up  selected  pixels  * 
t*  shift  image  row  h)  one  pixel  *. 

select  pixels  according  to  template  * 
/*  ,add  up  selected  pixels  */ 
t*  shift  image  row  h)  one  pixel  *. 


wst  W'R20,  &(corr[irow  ||0|); 


/*  store  correlation  row  ♦, 


Figure  6:  Loop  nest  after  transformations  to  ma\inii/e  pajre  mode  accesses 
3.4  Step  4:  W  ide  Register  Alloeation  for  Array  Variables 

l•■^guro  7  shows  a  simplified  version  of  the  final  code.  In  this  version,  we  transform  the  code  to  exploit  temporal 
reuse  in  wide  registers,  to  complement  the  spatial  reuse  in  Step  2.  IdTectively,  what  we  are  doing  is  performing 
code  transformations  to  facililale  allocating  array  variables  to  wide  registers,  as  is  done  in  conventional  archi¬ 
tectures  |C'arr‘>4|  [VV'olfi)2|.  Previous  approaches  achieve  this  goal  w  ith  some  combination  of  tiling,  unrolling 
and  fusion;  they  exploit  only  temporal  reuse,  as  spalial  reuse  only  comes  into  play  w  hen  registers  hold  multiple 
words  of  ditta.  In  DIV.A,  we  must  first  exploit  spatial  reuse  in  the  wide  registers  as  in  Steps  I  and  2,  and  then 
given  the  spatial  reuse,  also  exploit  temporal  reuse.  Here,  we  perform  tiling  to  move  the  temporal  reuse  closer 
together  in  the  iteration  space,  so  that  the  data  can  fit  in  the  limited  space  of  the  w  ide  register  file.  We  unroll  the 
tiled  loops  so  we  can  refer  to  the  appropriate  register  explicitly  in  the  code. 

In  the  example,  shown  in  Figure  7,  we  exploit  the  temporal  reuse  of  template  [trow]  carried  by  loop 
irow.  loops  irow  and  trow  are  tiled,  and  the  tile  sizes  are  chosen  so  that  a  set  of  image  rows  plus  a  set  of 
template  rows  fits  in  the  w  ide  register  file.  Furthemiore,  since  the  set  of  template  rows  (2  rows  shown  in  the 
example  code)  is  reused  in  the  tiled  loops,  each  template  row  needs  to  be  loaded  only  once,  before  a  new  tile  is 
executed.  In  practice,  the  tile  size  depends  on  the  dependences  and  the  size  of  the  w  ide  register  file.  In  our 
experiments,  we  use  a  tile  size  of  2  for  the  irow  loop  and  4  for  the  trow  loop. 


274 


/*  loop  irow  lik'd  b)  irow_tsz  */ 

for  (irow  0:  irow  <  32;  irow  +=  iiow_tsz) 

*  loop  trow  unrolled  by  2  * / 
for  (irow  I);  trow  <  32:  trow  +=  2) 

/♦  load  two  template  rows  in  page  mode  *, 

*  template  rows  are  reused  in  tiled  loop  irow’ 
w  id  \VR  15,  &(template[trow  |[0]); 
wkl  WR  16.  &( template! trow  '  *  II0|>- 

for  (irow  ’  =  irow;  irow’  <  min  (irow+irow_tsz, 

/*  load  2  image  rows  in  page  mode  * 
wId  WRl  I,  &(image|irow'’  trow]|0|); 
w  ld  WR12.  &(image|irow’*  trow]|32]); 
w  Id  WR  1 3.  &( image[irow’+trow  •  I  ]|0| ); 
wld  WR14.  &(image[irow’*  trow  •  I  ]|32]); 

w  ld  WR2(),  &(corr(irow  ’]|()|); 

for  (icol  =  0;  icol  <  32;  ieol  •  f ) 
wmul  WRl.  WR1I.WRI 5; 
WR2ii|icol|  ■  reduetion_sum  (WR  I); 
shifuiglit  (WRl  I.  WRl  1.  WRl 2); 
wmul  WRI.WRI3.  WRI6; 
WR2i)|icol|  reiliielion_sum  ( WR  I); 
shift_riglit(WRI3.  WRl 3.  WRl 4); 


wsl  WR2I).  &(corr|irow’||()]); 


/♦  traverse  image  rows  ♦/ 


f*  traverse  template  rows  */ 


*  load  template  row  *i 

*  load  next  template  row  ♦, 

32);  irow’  ++)  I 


!*  load  low  er  half  of  image  row  *i 
!*  load  higher  halfof  image  row  */ 

I*  load  low  er  half  of  next  image  row  */ 
/*  load  higher  halfof  next  image  row  ♦/ 

t*  load  correlation  row  */ 

/*  traverse  image  columns  *. 

/*  select  pixels  according  to  template  *. 
/*  add  up  selected  pixels  * 

!*  shift  image  row  by  one  pixel  *, 
i*  select  pixels  according  to  template  *. 
i*  add  up  selected  pixels  * 

I*  shift  next  image  row  by  one  pixel  *i 

*  store  correlation  row  */ 


( 


Figure  7:  Final  cntlc,  iiu  liiclin;'  transfonnations  f<ir  teiiiporal  reuse  in  registers 

4.0  Experiment 

We  simulated  four  versions  of  Ihe  benchmark  on  our  DIVA  simulator,  and  present  results  on  the  improvements 
due  to  the  transformation  steps  from  the  previous  section. 

4.1  DSIM  .Sinuilation  Ftivironnient 

We  have  developed  a  system  simulator  called  DSIM.  w  hich  uses  RSIM  as  a  framework,  with  significant  exten¬ 
sions  |RS1M|.  RSIM  models  shared-memory  multiprocessors  built  with  state-of-the-art  processors.  The  DSIM 
host  processor  is  taken  directly  from  RSIM,  as  well  as  the  host  first  and  second-level  caches.  Our  extensions 
include  a  simpler  PIM  processor  w  ith  a  WideWord  .M.U,  a  new  mernoiy  system,  and  a  new  PlM-to-PIM  inter¬ 
connect  network.  We  also  developed  application-level  primitives  for  DIVA,  such  as  a  Hush  instruction,  and  a 
barrier  for  PI  Ms  and  host. 

■fable  1  summarizes  the  host  and  PIM  processor  parameters  used  in  our  simulations.  DSIM  models  a  host  pro¬ 
cessor  with  out-of-order  instruction  execution,  multiple-issue  and  non-blocking  loads,  with  an  architecture 
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based  on  the  MIPS  R1(K)()0.  Both  LI  and  1.2  caches  are  pipelined  and  support  mulliple outstandiiii;  requests  to 
separate  cache  lines.  The  host  node  is  connected  to  the  memorv’  system  via  a  split-transaction.  Mbit-wide  bus. 

liach  PIM  node  includes  a  PIM  processor,  a  meinorv'  bank  (which  includes  control  and  arbitration  logic),  and 
an  interface  to  the  PIM-to-PlM  interconnect.  Hie  PIM  processor  is  much  simpler  (and  smaller)  than  the  host 
processor.  We  extended  the  RSIM  IS.A  with  D1\A  PIM  wide  instructions  that  operate  in  256-bit  wide  words. 
Lor  these  experiments,  we  make  the  conservative  assumption  that  the  PIM  processor  runs  at  half  the  speed  of 
the  host  system.  .Although  the  inherent  speed  of  the  logic  is  no  slower  as  we  are  assuming  the  DR.AM  is  embed¬ 
ded  in  a  logic  process  |1BM9‘)1,  we  make  this  assumption  because  the  wide  word  register  accesses  could 
impact  the  clock  speed. 

I'he  memory  system  consists  of  the  aggregation  of  all  PIM  memories,  where  each  local  memory  is  visible  from 
both  host  and  local  PIM  processor.  DSIM  models  each  PIM  memory  in  detail.  It  maintains  the  current  open 
row  in  the  memory  bank  to  determine  the  memory  access  lime  and  simulates  arbitration  between  host  and  PIM 
accesses. 


Host  Processor  and  Memory  Hierarchy 

PIM  Processor 

Issue  width 

4 

Issue  width 

1 

Integer  arithmetic  units 

2 

Integer  arithmetic  units 

1 

Floating  point  units 

2 

Floating  point  units 

1 

Address  generation  units 

1 

Wide  word  units 

1 

L 1  cache  size 

32IC  bytes 

Pini  Memory 

L 1  cache  hit  time 

1  cycle 

L2  cache  size 

1  M  bytes 

L2  cache  hit  time 

1 0  cycles 

L 1.  L2  cache  associativity 

2 

Memory  latency 

52  cycles  (page  mode) 

60  cycles  (random  mode) 

Memory  latency  (in  PIM 
processor  cycles) 

2  cycles  (page  mode) 

6  cycles  (random  mode) 

liihlc  I:  .Simulation  Parameters 


4.2  Results 

In  order  to  evaluate  the  benefits  of  the  compiler  transformations  described  in  Section  we  performed  experi¬ 
ments  using  four  versions  of  the  example  loop  nest.  I  he  first  version.  Scalar,  corresponds  to  the  original  loop 
nest  of  Cigure  Hie  second  version,  which  we  call  Fine-Grain,  corresponds  to  l-igure  4,  where  rnie-grain  paral¬ 
lelism  is  exploited  using  wide  instructions.  Hie  third  version,  called  Spatial  Reuse,  exploits  spatial  reuse  in  wide 
registers  as  in  Figure  5.  Hie  fourth  version,  called  Max  Page  Mode  +Temporal  Reuse  combines  the  transforma¬ 
tions  from  Figure  6  and  Figure  7  for  maxinii/ing  page  mode  accesses  and  exploiting  temporal  reuse  in  the  w  ide 
register  file. 

.All  four  versions  of  the  loop  nest  were  hand-coded  in  the  Dl\ A  PI  M  IS.A.  We  originally  ran  experiments  using 
the  original  sequential  loop  nest  from  Figure  w  hich  was  w  ritten  in  C  and  compiled  with  optinii/ation  level  4. 
However,  since  the  code  generated  by  the  compiler  is  not  tuned  to  the  D1\A  architecture,  a  comparison 
between  the  hand-coded  transformed  loops  and  the  compiled  scalar  version  unfairly  skewed  the  benefits  of  our 
approach.  We  thus  hand-coded  the  sequential  version  in  order  to  perform  a  fair  comparison  of  all  versions  of 
the  loop  nest.  In  our  hand-coded  version  of  the  original  loop  nest,  we  unrolled  the  innermost  loop  by  a  factor  of 
4,  so  that  a  single  .^2-bit  load  brings  4  pixels  to  the  register  file  on  each  loop  iteration. 
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lablo  2  shows  the  elTeet  of  the  transformations  on  the  number  of  dynamic  instructions  executed  and  on  the 
number  and  type  of  memory  accesses. 


code  version 

#  instrs 

total  reads 

%  page 
mode  reads 

total  w  rites 

%  page 
mode 
writes 

Scalar 

.505. 1)4  M 

25.17  M 

55.5.5% 

25.50  M 

O'-b 

Fine-Grain 

5.5.47  M 

5.15  M 

55,5.5% 

0.25  M 

56.25% 

Spatial  Reuse 

27.57  M 

0.25  M 

57.12% 

0.20  M 

65.07% 

Max  Page  Mode  + 
Temporal  Reuse 

20.55  M 

0.1  1  M 

77.80% 

0.05  M 

74.00"o 

I'iiblc  2:  iiiipuct  of  rnuisforiiiatioiis  on  Memory  .Accesses 


Using  Scalar  as  a  baseline,  we  see  that  roughly  l.i%  of  the  instructions  are  memory  accesses.  .As  compared  to 
the  baseline,  we  see  a  factor  of  over  350  reduction  in  memory  accesses  in  the  final  version,  and  a  factor  of  over 
f3  reduction  in  dynamic  instnictions  executed,  fhis  improvement  is  due  to  several  factors.  Fine-Grain  shows  a 
factor  of  almost  15  reduction  in  memory  accesses  and  a  factor  of  over  1 1  reduction  in  dynamic  instructions  by 
exploiting  the  available  memory  bandwidth  using  the  wide  .Al.U  and  wide  data  path  to  memory,  [{.xploiting 
spatial  reuse  in  registers,  particularly  with  the  shifting  operation,  results  in  a  factor  of  over  6.5  reduction  in 
memory  accesses  in  the  Spatial  Reuse  version,  as  compared  to  Fine-Grain,  but  just  a  modest  reduction  in  dynamic 
instructions  executed,  fhe  number  of  w  rites  actually  increases  we  are  computing  partial  correlation  sums  after 
interchanging  the  loops,  which  are  written  back  to  memory,  fhe  fourth  version  shows  an  additional  lactor  of 
3.1  reduction  in  memory  accesses  as  a  result  of  exploiting  temporal  reuse  in  registers,  and  a  slight  increase  in 
the  number  of  dynamic  instructions  executed  due  to  the  tiling  control  loop.  We  also  see  that  in  the  final  version 
over  74“  u  of  remaining  reads  and  writes  are  now  in  page  mode  (as  compared  to  less  than  17%  in  the  original 
version),  which  results  in  a  lower  average  memory  latency. 

We  also  performed  experiments  comparing  the  execution  of  the  entire  application  on  the  host  processor  against 
a  version  running  on  the  host  and  multiple  PIM  processors.  Figure  8  shows  the  speedups  with  respect  to  the 
original  sequential  program  running  on  the  host  processor,  fhe  PIM  versions  were  obtained  by  replicating  the 
(4  Kbyte)  image  and  assigning  a  subset  of  templates  to  each  PIM  node,  w  ith  no  PIM-to-PIM  communication, 
bach  PIM  node  computes  the  matches  on  its  local  templates.  At  the  end  of  the  PIM  phase,  the  host  collects  the 
PIM  results  and  computes  the  best  match  across  all  templates.  The  PIM  code  is  hand-coded  in  the  DIV.V  IS.k. 
and  it  is  based  on  the  final  version  from  Figure  7.  The  benefits  of  exploiting  fine-grain  parallelism  and  reducing 
memoiy  costs  result  in  a  speedup  of  2.8  for  one  PIM  node.  Combining  these  benefits  with  the  coarse-grain  par¬ 
allelism  exploited  by  distributing  the  computation  among  several  PIM  processors,  we  observe  a  speedup  of 
.58.2  on  .52  PIM  nodes.  Due  to  simulation  time  constraints,  we  used  a  data  set  si/e  of  .52  templates  for  the  exper¬ 
iments  shown  in  Figure  8.  fherefore,  each  PIM  node  is  assigned  only  one  template  in  the  .52-PIM  version,  and 
the  cost  of  replicating  the  image  (which  is  performed  sequentially  by  the  host)  becomes  a  significant  fraction  of 
the  total  execution  time.  We  expect  the  speedups  to  scale  w  ith  the  number  of  PIM  nodes  when  using  larger  data 
set  sizes,  since  there  is  no  PIM-to-PIM  communication. 
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Figure  8;  Speedups  over  host-only  exeeution  with  inereasing  numbers  of  PIM  nodes 

5.0  Sumniary  and  Future  Work 

I  his  paper  prosonlcd  code  transformations  for  taking  advantage  of  the  proeessor-memory  bandwidth  in  DIVA 
and  related  PIM-based  architectures:  exploiting  fine-grain  parallelism  in  w  ide  word  instructions,  reuse  in  the 
large  register  file,  and  page  mode  memory  acce.sses.  We  showed  these  techniques  are  highly  beneficial  for  one 
image  processing  code;  using  the  original  hand-coded  version  as  a  baseline,  we  see  factor  of  over  i5i)  reduc¬ 
tion  in  memory  accesses,  .\nother  nice  feature  of  this  (and  many  other  image  processing  and  multi-media 
applications)  is  that  it  can  also  exploit  coarse-grain  parallelism  with  (little  or)  no  communication;  each  node  on 
each  PIM  can  execute  independently.  As  a  result,  the  application  yields  scalable  parallel  performance  as  more 
i’lM  nodes  are  introduced,  with  a  speedup  of  .^8.2  over  host  execution  on  a  DIVA  system  with  .i2  PIM  nodes. 

We  are  working  to  automate  this  approach  in  the  Stanford  SUIl-  compiler,  w  hich  we  are  using  as  the  basis  of 
the  Dl\  .\  compiler.  .Almost  all  of  the  transformations  are  already  implemented  in  the  Sl  .ilf  system,  as  well  as 
the  tests  for  safely  and  an  algorithm  to  guide  the  transfomiations  based  on  paralleli/ation  and  reuse  analysis.  To 
automate  our  approach,  we  need  to  implement  a  few  additional  transformations  including  statement  reordering, 
shifting  and  allocation  of  array  variables  to  w  ide  registers.  We  must  also  develop  a  new  algorithm  for  guiding 
the  transformations  based  on  the  requirements  of  the  DIVA  architecture. 
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Abstract 

The  ikiui-lnten.sive  Archilecliire  (DIVA)  system 
meorponiles  Troeessiii^-lii-.\/eim>ry  (TIM)  chips  as 
smarl-memory  eopraeessors  to  a  mierapraeessor.  This 
arehueelure  exploits  inherent  memory  haiuhviJtli  both 
on  chip  aiul  across  the  system.  Thus,  perjormanees  of 
pointer-hasetl  anil  sparse-matrtx  computations  as  well 
as  iniiliimeiha  appi  teat  ions  are  si^ni/icanily  enhanced. 

key  Jeat lire  of  the  I III  A  architecture  is  the  address 
translation  mechanism,  which  supports  virtual 
addressing  of  application  code  and  data.  Instead  of 
prohthttive  convenitonal  pa))e  tables.  Dll  A  provides  a 
simplified  mechanism  usirifi  sef^inents.  In  this  /iiiper,  the 
desiyn  of  die  address  Iranslalion  iiiiil  is  pre.senied,  and 
trade-offs  in  l'I..SI  design  including,  performance,  area, 
and  desiyii  miHliilatioii  are  also  discus.sed. 

1.  Introduction 

The  Data- Intensive  Architecture  (DIVA)  project  is 
building  a  \iorkstation-class  s>steni  using  embedded 
meinorv  technology  to  replace  the  memory  system  of  a 
conventional  workstation  with  "smart  memories"  capable 
of  very  large  amounts  of  processing.  The  goal  of  the 
project  IS  to  significantly  reduce  the  ever-increasing 
processor-memory  bandwidth  bottleneck  in  conventional 
systems.  System  bandwidth  limitations  are  thus 
overcome  m  three  ways,  as  illustrated  in  Figure  I:  (I) 
tight  coupling  of  a  single  priKessmg-in-memory  (PIM) 
processor  with  an  on-chip  memory  bank;  (2)  distributing 
multiple  processor-memory  mxies  per  Pl.VI  chip;  and  (3) 
utilizing  a  separate  ehip-to-ehip  interconnect,  for  direct 
communication  between  nodes  on  dilferent  chips  that 
bypasses  the  host  system  bus. 


Figure  1  DIVA  system  architecture 


fhis  paper  describes  the  design  of  an  address 
translation  unit,  which  is  the  key  component  that 
implements  memory  management  in  DIVA  PIM  chips. 
Prev  lous  literature  1 1 1  distinguished  two  aspects  of 
memory  management  requirements  from  that  of  other 
PIM-based  architecture. 

•  file  PIM  serves  as  the  only  memory  for  a  standard 
host  mieroprrKessor,  assuming  the  duel  role  of 
"smart  memories"  and  conventional  memory 

•  DIVA  targets  applications  that  are  most  severely 
impacted  by  the  prcKessor-memory  bottlenecks  in 
conventional  systems:  sparse-matrix  and  pointer- 
based  applications  with  irregular  memory  access 
patterns,  and  image  and  video  applications  with 
large  working  sets. 

As  compared  to  sy .stem-on-chip  solutions  12-31,  and 
multiprvicessors  made  up  solely  of  PIM  chips  14-5], 
diva's  supiiort  for  conventional  memory  access  from  an 
external  host  requires  a  dual  view  of  memory,  from 
host's  and  the  PIM's  pers|x;ctive.  A  much  broader  range 
of  programming  paradigms  are  provided  when  compared 
with  other  PIM  architectures.  As  a  result.  DIVA  requires 
an  efficient  address  translation  mechanism  and 
independent  threads  of  control  as  the  features  in  its 
memory  models.  A  previous  pai'ier  presented  an 
overv  iew  of  the  DIVA  project  and  described  a  memory 
nwxiel  16]  and  memory  management  1 1  ]  to  support  these 
requirements,  fhis  paper  is  focused  on  the  design  of  the 
address  translation  unit  and  its  eircuit  implementation, 
■file  remainder  of  the  pajier  is  organized  as  follows. 
Section  2  describes  the  mechanism  of  address  translation 
in  DIVA.  Section  3  presents  the  detailed  hardware  design 
of  the  address  translation  unit  (.ATIJ).  Section  4  presents 
a  VLSI  miplementation  and  result.s,  and  Section  5 
concludes  the  paper. 

2.  .\ililrcss  ti'iinslation  median  is  in 

■file  virtual  address  space  of  the  host  priKessor  in  the 
DIVA  architecture  can  lie  categorized  into  three 
cla.ssifications: 

•  (ilobal  memory  is  composed  of  contiguous  segments 
distributed  across  nodes,  visible  to  applications 
running  on  the  host  and  PIM  nodes. 

•  Dumb  memory  is  a  region  of  a  node's  memory 
allocated  as  conventional  pages  in  a  host 
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application's  virtual  space  and  untouched  b\  PIM 
node  processing. 

•  Local  memory  is  a  region  of  a  node's  memorv  used 
exclusiveh  b\  node  routines  This  rule  is  e.xcepted 
during  initialisation  \vlien  the  host  system  boot 
process  loads  node  sofhvare 

A  node  must  be  able  to  rapidh  determine  if  an 
address  is  located  in  its  own  memorx  ,  and  if  so,  find  the 
ph\sical  address.  Segments  are  used  to  condense 
translation  information  Ilach  segment  is  defined  b> 
segment  registers  containing  a  base  address  and  size. 

The  local  memorv  region  is  partitioned  into  eight 
segments  in  the  DIVA  architecture.  Like  pages  in  a 
conxentional  system,  the  segment  descriptors  are  generic 
in  nature  It  is  onl>  through  system  programming  that  the 
segments  serve  a  specific  purptise  [I]. 

Remote  addresses  are  translated  via  the  concept  of  a 
home  node,  which  is  guaranteed  to  have  the  translation 
information.  In  addition  to  the  local  segments,  a  node 
maintains  translation  information  for  its  resident  pvirtion 
of  the  global  memory  ,  as  well  as  for  anv  remote  data  for 
which  it  is  the  home  node 

The  primarv  functions  of  the  node  address  translation 
unit  are  to  translate  virtual  addresses  to  physical 
addresses  for  those  acces.ses  that  are  locally  resident  and 
to  provide  access  protection.  The  types  of  accesses 
generated  by  a  DIV,<\  PIM  processor  that  require 
translation  include  instruction  fetches  and  data  accesses 
to  memory  or  memory -mapped  devices  such  as  parcel 
buffers,  generated  by  load  or  store  instructions. 

Ciiven  the  simplicity  of  the  address  translation  scheme 
discussed  above,  very  little  hardware  support  is  needed 
to  elTect  translation.  A  segment  base  address  register  and 
limit  register  is  needed  for  each  of  the  eight  local 
segments  Also,  one  \  irtual  base,  limit,  and  phy  sical  base 
register  are  needed  for  each  resident  global  segment.  The 
DIVA  architecture  provides  four  sets  of  global  segment 
registers.  The  address  translation  unit  contains  no  direct 
support  for  home  mxie  translation,  although  the  preferred 
system  programming  is  such  that  the  global  segments 
rL>sident  on  a  node  form  the  portion  of  global  memory  for 
which  that  node  is  the  home  node.  If  this  is  not  the  case, 
address  faults  invoke  system  software  that  performs  the 
home  node  translation. 

3.  I)csi«jn  of  AH 

The  DIV.A  PIM  processor  provides  4  (ibytes  of 
virtual  address  space  accessible  to  kernel  and  user 
applications  via  segments  that  are  a  power  of  2  in  size. 
Segment  sizes  can  range  from  256  bytes  to  the  maximum 
amount  of  physical  memory  available  to  a  node.  The 
maximum  segment  size  in  the  initial  DIVA  system  design 
IS  1 6  .Mbytes.  Lach  virtual  address  generated  by  the  PIM 
prwessor  is  32  bits,  and  the  resulting  physical  address 
generated  by  the  address  translation  unit  is  also  32  bits. 

■fhe  PIM  processor  address  translation  unit  supports 
three  main  ty  pes  of  address  translation:  direct  address 


translation,  local  address  translation,  and  global  address 
translation 

Figure  2  shows  the  three  main  address  translation 
nic*chanisms  prov  ided  When  the  address  translation  unit 
is  disabled,  direct  address  translation  occurs,  and  the 
address  translation  unit  will  not  generate  any  exceptions. 
In  this  case,  the  resulting  physical  address  is  identical  to 
the  virtual  address.  If  address  translation  is  enabled,  then 
the  scoive  field  of  the  virtual  address  much  be  inspected 
to  determine  vvtiat  ty  pe  of  translation  should  be  used. 

fhe  scope  field  of  the  virtual  address  is  the  most 
significant  five  bits  of  the  v  irtual  address  VA.  If  this  5-bil 
value  IS  zero,  then  liK-al  translation  is  used.  If  the  scope 
field  equals  binary  value  00001,  i.e.,  the  v  irtual  address 
falls  111  the  range  of  0x08000000  to  O.xOITTT'FFF,  direct 
translation  is  used  to  generate  the  physical  address; 
however,  unlike  the  mode  where  address  translation  is 
disabled,  an  exception  can  be  generated  in  this  ca.se  if 
access  priv  ileges  are  violated.  By  definition,  the  address 
region  O.xOSOOOOOO  to  OxOFFFFFFF  is  a  superv  isor  lev  el 
region.  Therefore,  any  user-level  attempt  to  access  this 
region  while  address  translation  is  enabled  will  trigger  an 
exception.  La.stly  ,  if  any  of  the  four  most  significant  bits 
of  the  virtual  address  are  non-zero,  i.e.,  va|0:3)  !=  0,  then 
global  translation  is  used 
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Figure  2.  Address  translation  types 
Figure  3  shows  the  steps  involved  in  local  address 
translation.  The  3-bit  index  field  of  the  v  irtual  address  is 
used  to  select  a  set  of  local  segment  registers  for  the 
translation.  The  segment  base  is  simply  bitwise-ORed 
with  the  zero-padded  offset  of  the  v  irtual  address  to  form 
the  physical  address.  The  specified  segment  limit  register 
is  also  accessed  and  manipulated  in  conjunction  with  the 
oft'set  to  determine  if  the  virtual  address  is  valid 
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Figure  4  shows  the  steps  involved  in  global  address 
translation,  which  is  a  reverse  address  translation  style. 
In  this  case,  the  address  is  checked  to  see  if  it  is  map|x;d 
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locally  by  simply  ensuring  that  the  address  is  within  the 
range  speeilied  by  a  \alid  set  of  the  global  segment  base 
address  and  limit  registers.  The  hardware  does  not 
protect  against  overlapping  global  segments  The 
multiple  sets  of  global  segment  registers  are  checked 
concurrently  to  see  ifany  one  of  them  should  be  used  for 
the  translation,  similar  to  a  fully  associative  cache.  If 
there  is  a  match,  the  virtual  address  is  simply  translated 
into  a  phy  steal  address  by  a  bitwise-()R  of  an  offset  with 
the  global  segment  phy  sical  base  register  of  the  matching 
global  segment.  I'he  offset  is  formed  by  using  the  limit 
register  of  the  matching  segment  to  mask  olf  the 
appropriate  part  of  the  virtual  address. 

Tittwl  allies  IW) 


Figure  4.  Global  address  translation 
In  addition  to  the  translation  of  virtual  addre.sses  to 
physieal  addresses,  the  address  translation  unit  provides 
access  protection  and  bounds  checking  to  ensure  that  the 
offset  portion  of  an  address  is  not  outside  the  range  of  the 
segment,  fhe  2  PR  bits  of  a  segment  limit  register 
specify  the  access  protection  mode  for  that  segment, 
■fable  I  shows  the  possible  access  modes  and  their 
eorresponding  encodings. 


Table  1.  Segment  access  modes  and 
coTesponding  PR  bit  encodings 
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liach  local  segment  limit  register  consists  of  a  limit 
value,  a  valid  bit.  and  the  two  PR  bits.  The  first  level  of 
protection  for  local  addresses  is  provided  by  ensuring 
that  a  valid  set  of  segment  registers  is  used.  If  the  V  bit 
of  the  selected  loeal  segment  is  not  asserted,  an 
unmapi'K.'d  access  exception  occurs,  fhe  second  level  of 
protection  is  provided  by  the  PR  bits.  If  the  PIM 
proeessor  mode  and  access  ty  pe  are  not  allowed  by  the 
PR  bit  setting  of  the  selected  segment,  an  invalid  access 
exeeption  occurs,  fhe  final  level  of  protection  for  local 
addresses  is  provided  with  bounds  checking.  The  limit 
value  of  the  specified  segment  is  used  to  inspect  bits  in 
the  virtual  address  offset  to  ensure  that  the  offset  has  not 
exceeded  the  segment  si/.e.  If  the  segment  size  is 
exceeded,  an  unmapped  access  exception  occurs, 
liquation  ( 1 )  specifies  the  exception  condition  E  for  local 
translations. 

fi  =  (vUg  A  limit|mdex|j )  v(va,  A  limit|mdex|,  ) 

V--- vfva^j  A limit[index]2j ) . (1 ) 


.Mthough  the  conditions  for  address  translation 
exceptions  for  global  v  irtual  addresses  are  similar  to  that 
of  local  addresses,  the  mechanism  is  quite  dilferent  due 
to  the  fully  asstveiative  nature  of  the  global  segment 
hardware  Basically,  if  one  of  the  four  sets  of  global 
segment  registers  diK's  not  maich  an  attempted  global 
address  access,  an  exception  (K'curs  A  successful  mulch 
occurs  when  a  set  of  segment  registers  is  valid,  the  PR  bit 
setting  allows  the  access  type  being  attempted,  and  the 
address  range  specified  by  the  global  virtual  base  and 
limit  encompasses  the  global  address  of  the  operation. 
Equation  (2)  specifies  the  range  match  condition  RM 
where  va  is  the  virtual  address  and  base  is  the  contents  of 
the  global  virtual  base  register. 

RM  =(limito  Afvao  0basea  ))v(  limit,  A(va,  ©  base, )) 
V-”  vflimity  A(va^  ©base^,)) . (2) 

An  unmapped  access  exception  is  triggered  if  there  is 
no  v  alid  set  of  registers  that  satisfies  the  range  match  test, 
if  there  is  a  valid  set  of  registers  that  satisfies  fhe  range 
mateh  tesf,  but  the  PR  bits  for  that  segment  do  not  allow 
the  attempted  access,  an  invalid  access  exception  occurs. 

4.  Implcmcntiition  and  results 

Based  on  the  design  presented  in  Section  3,  the 
schematic  for  the  ATU  design  is  presented  in  Figure  5. 
Four  major  components  were  designed  and  implemented 
with  Synopsys  UhvIs:  virtual  to  physical  translation! V2P) 
nuxlule.  controller,  speeial  purpose  register  tiles,  and 
probe  circuit. 


Figure  5.  Schematic  of  ATU  implementation 
Design  issues  for  these  bloeks  are  as  follows: 

•  V2P  module:  This  is  the  core  eircuit  to  implement 
the  translation  scheme  specified  in  Section  3.  The 
key  design  issue  for  this  circuit  is  the  translation 
speed  To  achieve  the  defined  s|x:cification  of  .*ins. 
several  techniques  in  VUDl.  cvxling  for  synthesis 
were  required,  fhe  result  was  a  pure  combinational 
logic  circuit  that  is  able  to  implement  the  translation 
mechanism  with  minimum  overhead  in  speed  and 
circuit  area. 
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•  Sjiecial  purpose  register  file:  This  module  is  an 
interlace  lor  the  PIVl  processor  to  set  up  the 
translation  table.  Since  simultaneous  translation  and 
table  set-up  will  ne\er  occur  with  pro|'>er  system 
software,  the  core  issue  of  this  module's  design  is  to 
reduce  the  circuitry  area  while  pro\  iding  a  wide  data 
bus  with  low  propagation  speed,  which  prosides  a 
complete  static  table  hwk  up  for  V2P  module  to 
speed  up  the  translation. 

•  Controller:  This  naxlule  translates  bus  signals  and 
memory  access  signals  to  two  V2P  modules,  one  for 
processor  memory  requests  and  the  other  for 
instruction  cache  memory  requests 

•  Probe  Circuit:  This  module  is  used  when  a  specific 
DIVA  instruction  is  used  to  probe  the  address 
translation  unit,  allowing  a  user-le\el  process  to 
interrogate  the  status  of  a  virtual  address  without 
incurring  an  e.vception  if  the  address  is  not  mapped 

1-ach  module  was  designed  and  synthesized  using 
Synposys  tools;  the  timing  and  circuit  size  of  each 
mtKlule  was  optimised  using  the  constraints  mentioned 
above.  The  circuitry  of  the  whole  address  translation  unit 
was  generated  using  Cadence  Silicon  linsemble.  fhe 
design  is  based  on  TSMC  0. 18|.im  technology  with 
-Artisan  standard  cells,  fable  2  summarizes  the  results. 
Varying  the  use  of  ditferent  optimisations,  a  great 
dilference  in  circuitry  area  is  observed  After  several 
iterations  in  both  synthesis  and  layout  generation,  a  30“o 
reduction  in  circuitry  area  is  achieved  while  maintaining 
the  same  fast  translation  speed  by  ignoring  the  constraint 
of  the  special  register  file's  data  path,  which  does  not 
affect  the  translation  speed. 


Table  2.  Circuit  Summary 


(iate  counts 

7967 

Power 

Core  18  93mW  Total  41  7gmW 

.Area 

500  X  450  pm 

Percentage  of 
modules'  area 

V2PMemor>'  19  95%.  V2P Cache  19  58”. 
Controllers  0  035i,  Probe  circuitry  0  2% 
Snecial  nurpose  reeisier  tile  60  24“.. 

In  Figure  6,  a  layout  of  the  A  fC  is  presented.  The 
purpose  of  this  layout  is  to  form  an  initial  estimate  of  the 
overhead  of  implementing  an  address  translation 


fhis  circuitry  occupies  500  .\  450  pm.  which  is  2.2% 
of  the  DIVA  PIM  processor  area  [7-8j.  Considering  most 
of  the  circuitry  (60“b  of  standard  cells  is  register  tiles)  is 
not  switching  during  the  translation,  there  is  only  a  2.9®o 
p<iwer  increase  for  the  DIV.A  PIM  processor 
(28  8mVV  580mW)  to  support  the  address  translation 
mechanism,  fhe  overall  delay  for  the  ATIJ  is  4.76ns. 
which  IS  sulTiciently  fast  to  integrate  it  into  the  DIVA 
PIM  processor  without  any  extra  delay. 

5.  ('(inclusion 

This  paper  has  presented  the  design  and 
implementation  of  the  .'\ddress  Translation  l.'nit  to  be 
used  in  the  DIVA  PIM  processor.  An  implementation  of 
this  design,  based  on  TSMC  0.18pm  technology,  has 
proven  to  be  easily  integrated  into  the  current  DIVA  PIM 
prototype  The  A'fU  is  a  key  component  to  enable 
diva’s  memory  management  design,  which  is  essential 
for  a  user-friendly  programming  paradigm  for  PIM 
systems  like  DIVA. 
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ABsruAcr 

'flic  DIV'A  (Dat^k  IiitnLSiVe  .\rddtiecture)  s>'Stiem  uicorpo- 
kU.cs  a  colkdioti  of  Proccssuig-Iji-IMeuioo'  diips  as 
smart-uiciriorv  co-proce9»rs  to  a  coaiwiitioiial  iiiicroprooes- 
soi.  We  luWTC  recently  falwiciatcJ  p(rotoit>"pc  DIV'.V  PIMs. 
'Ilicsc  diips  represent  the  first  sinart-inemorj'  devices  <k>- 
signed  to  support  -kirtual  addressing;  and  capal>le  of  e.xccut- 
iiig  multiple  tlueacLs  of  controL  In  this  paper,  viV  descTil>e 
tlie  protobme  PTM  ardiitecture.  We  cinpliaaiac  throe  uiuaue 
features  of  DIV'V  PIhLs,  luamclv,  tlie  meinjorv'  interface  to 
tlio  host  processor,  tlie  250-l>it  wide  datapatlis  for  exploit¬ 
ing  on-ehip  haiid^ktdtlL,  and  the  address  txaiutlation  unit.  Wc 
present  detailed  siinulatiosi  results  on  eiglit  hendimark  ap- 
plicatioiLS.  Wlien  just  a  single  PIM  cliip  ts  used,  vre  adiic« 
an  average  speedup  of  3.3X  cn'cr  liost-oidy  execution,  due 
to  lorw  memoia-  stall  times  aiul  uicreased  linc-Eraiii  pftfiilr 
Idism-  Tliese  l-PBI  results  suggest  tliat  a  PIM-t«.scd  a> 
diitecture  ivith  many  sudi  ddps  iddds  significantly  liij^ier 
perfoniianee  tliaii  a  inultijvoccssor  of  a  siiiiUar  scale  aiul  at 
a  mudi  reduced  hardvi'arc  cost. 

C'alc}»<)rics  and  Subject  Descriptors 

B.5.2  Hardwapot  hlemoia'  Structures;  C.L^  ^Computor 
Systems  OraaniDsation :  Processor  .Ardiitectures — Muki- 
pie  Data  fftream.  ArchUectarcs  (Muitiproc£ss<fr9} 
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1.  INIROnUCTION 

\  recent  trend  in  computer  arclutccture  coinhiiies  process¬ 
ing  logic  skith  iiicmory  in  intelligent  proccssiiig-in-mcmors' 
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(PBD  diips  to  address  Uie  w'cll-knowTi  performance  gap  l>e- 
twccii  processor  and  memory  spceib  ]2,  7,  8,  12,  15,  1C, 
17,  ‘2(3,  21,  22,  25,  27,  28,  30].  ALuiy  prckiions  ardiitcctmal 
solutions  to  tluc  proccssor-mconory  gap  sudi  as  multithread¬ 
ing,  prefetcliing,  aiwl  speculation,  seek  to  reduce  or  tolexate 
memory  latenCk',  at  tlie  expense  of  increswed  memork'  Isuul- 
kvidth  requiremeaits  jSj.  PIMs  mstead  ilraiiiAtically  itiiprok'e 
memory  Ijarndwidth,  by  10-KX)X  Ok-cr  conventional  DR.\M 
SkTitenus,  because  iutcrnal  prooea»rs  can  be  directly  con- 
ucctckl  to  the  memory  Ixuiks.  latency  to  on-diip  logic  Is 
also  reduced,  dokvn  to  less  tluan  one  liaJf  tliat  of  a  cotikeib- 
tional  memory  s.k’Stem,  because  iutcnial  memory  accetses 
avoid  the  dela,vs  associated  with  conununicatuig  off  diip, 
Fbr  tlie  last  four  years,  tlie  authors  base  becai  developing 
one  such  POkl-lkwed  system  caUed  Data  IrrtensiV'e  .Ardiitcc- 
ture  (DIV'.A),  Tlie  ultimate  goal  of  the  DIV’.V  project  Is  to 
design  aiul  build  a  prototype  kvorkstation-da.ss  SkTSteiu  kkdiere 
PBIs  sene  as  smart-memory  co-processors  for  an  otlierwise 
coukeaitional  siTiteim  In  tlus  paper,  vte  desciibe  tlie  pro- 
tot>-pc  DBA  PDkl  diip,  sliOkkii  iii  Figure  1,  kk-liich  we  liave 
recently  fabricated. 


Figure  1:  Mkrophotograph  of  a  DIVA  PIM. 

DIV'A  targets  tkkO  important  da'KCS  of  Isuwhkidth-lunited 
applicatioiLS;  multimedia  and  irregular  applieatioiLS,  includ¬ 
ing  sparse-inatrix  and  pointer  computations.  Multimedia 
applicatioiLS  perfonn  repeated  cornputatioaLS  on  streams  of 
data,  often  with  little  temporal  ilata  reuse.  As  processors 
exploit  increased  parallelism,  multuvKdia  applicatioiLS  be¬ 
come  memory  bound  ]23I.  Pcrformaiice  of  applicatious  vkith 
irregular  data  accesses  is  also  dominated  b>'  memory  stalls 
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Rince  Rudi  AppUcatioiLS  ilruaIIv  luvt'C  itcitlicr  toinpoKil  nor 
RpoLii\l  leusc'  of  (laU  iicotlctl  to  mAko  c■f^ccti^'e  use  of  caclic  3. 
DIVA  accelerates  both  classes  of  applications  bj'  pcrfoiiuiiu; 
computation,  directly  in  memory,  rcquiriiiE  iio\el  uiulcrly- 
uig  liai:d^\'are  structures,  describcxl  in  Uus  paper.  StreaiiiiiiA 
multimo<Lia  applicatious  obtain  liijtli  I^ulmdtli  to  oii-cliip 
memories  tlirougli  a  2-Xj-bit  ftide  datapath,  wliilc  irrcEu- 
htr  applications  benefit  from  very  low  Latcnen'  accesses  to 
memory.  .As  a  result,  mucli  of  tlie  traffic  beU'cai  tlie  liost 
processor  aiul  memor>'  is  eliminatecL 

Our  experience  with  tlie  DIVA  PDI  drip  luts  important 
implications  for  future  arcliitectures  tliat  seek  to  maxiiniec 
memory  IjaiuhvUltlu  We  demojLStratc  tliat  aiiuple  Ixut  po>v- 
erful  liardware  mediariisms  cair  yiekl  si£niflcarLt  perfonnanco 
inaptwemciits  on  liaiuhvUlth-limited  lapplications.  kLuiy  of 
tliese  luardwaie  features,  induding  address  tnaiLslatioai,  the 
memory  interfitce  anul  memor>-to-inemary  interooiuiect,  are 
specifically  oriciited  towards  ardiitecturcs  suck  as  DIV’.V,  in 
wiiidi  PDLs  are  smait-nicmor>’  co-processors  to  a  corn'Cii- 
tional  host.  Many  otlier  features  are  suitable  for  comeu- 
tional  processors  and  aubedded  s^'StennwMt-a-diipi,  such  as 
the  (lesioi  of  the  WkleWorJ  unit. 

Ill  two  prcs'ious  papers,  we  presented  the  DIVA  svetcin 
.ardiitceture,  memory  model  ajul  simulated  performance  im- 
piwemciits  due  to  coarsc-grain  parallelism  in  PENLs  for  3 
programs  |9l,  and  we  described  sv'Stean  software  require¬ 
ments  and  memory  iiuanagement  functionality  AO].  Tliis 
paper  focuses  on  the  DIVA  PIM  dcsicc  ajul  makes  tlie  fob 
lowing  unique  cantriljartions. 

•  It  Ls  Uie  first  detitifcxl  (kscriptiou  of  tlie  DIVA  PIM 
miexoardutccture. 

•  It  pinpoints  some  of  Uie  design  issues  Uuat  must  be 
considered  in  future  ardutecturcs  for  exploiting  mcm- 
or>'  biuidwidtli. 

•  It  presents  simulation  results  demonstrating  an  a^er- 
age  speedup  of  3,3X  on  S  progranrs  as  compared  to 
a  ooip’entioiutl  liost.  'Ilie  speedups  .are  due  to  up  to 
a  5)G%  reduction  iii  meinor>'  stall  time,  aiul,  for  -1  of 
tlie  programs,  an  a\'erafie  speedup  of  ‘JJf-lX  due  to  the 
WideWord  unit  as  compared  to  scalar  PIM  executioiu 

Tlie  remauuler  of  the  paper  Ls  organiaed  as  follows.  'Tlie 
next  section  suiiunariiacs  tlie  DD’A  s>'Stem  cardiitecture,  to 
set  tlie  context  for  the  PDI  inicroardLitecture  discussioiL 
SecUoii  3  describes  tlie  microardiitecture  in  detaiL  Section  -1 
presents  a  set  of  simulation  results  oti  ciglit  programs.  Sco 
tion  o  presents  tlie  status  of  tlie  DIV'A  project.  Section  0 
presents  related  work,  aiul  Section  7  concludes  the  paper, 

2.  S\  S I  KM  ARC  in  I  KC  I I  RK  ()\  KRMKW 

Tlie  DIVA  s>'stcin  ardiitecture  was  specifically  designed  to 
support  a  smooth  migration  path  for  appUeaUon  software  Iw 
uitegrating  PIMs  into  comentioiud  SNTrteins  as  seamlessly  as 
possible.  DD’.A  PDIs  resemble,  at  Llieir  interfaces,  coimncr- 
cdal  DIAAMs,  enabling  PIM  memory  to  be  accessed  Iw  liost 
software  eitlieras  smart-memor>'  co-processors  or  as  coin'en- 
tionaJ  memory.  Li  Figure  2,  we  sliow  a  small  set  of  PDLs 
connected  to  a  hest  processor  thronglL  ncatin  conmrtioaal 
memory  control  logic  (see  Sectiou  3.1  for  <letaiLs  on  required 
modifications).  .A  separate  men]ory-to-inenior>'  interconnect 
eiutblcs  communiesatiott  bctweeit  memories  witliout  irtvolviiig 
tlie  host  processor. 


Figure  2:  DIVA  system  ardiitocture. 


Spawning  computation,  gatlicriiig  results,  s>'ndu:oirh!ing 
actinty,  or  simply  accessing  non-loeal  data  is  accomplLslied 
via  parcels.  .A  parcel  is  closely  rektod  to  ait  actot  message 
as  it  is  a  rektitely  ligjitweidrt  coriununication  meclianism 
containing  a  refercaice  to  a  function  to  be  invoked  wlicir  tlie 
parcel  Ls  recciral  20],  Parcels  are  distiaguislicd  from  actite 
messages  in  tliat  the  destination  of  a  parcel  is  an  object  in 
memor>',  not  a  specific  processor. 

Parcels  are  traiLSmitted  tlirougjr  a  separate  PIM-to-PDI 
iirterconiiect  to  eiutblc  communication  w'ithout  intcrfi;riiig 
with  host-memory  traffic.  Tliis  intcroormect  must  support 
the  dense  padring  requirement  of  memory  deriioes  and  allow 
the  addition  or  removal  of  devices  frean  the  sv'Stem,  Fbr 
s>’stcm  sizes  of  the  scale  expected  for  DIVA  (on  tlie  order 
of  32  PDI  diips),  tilts  combuuaUon  of  requirements  favors  a 
onc^dimensional  network  ’14].  Future  generaUoius  of  DIV.A- 
like  s.vstems  tliat  conLain  krge  numbers  of  PIM  drips  will 
require  a  more  comple.x  iiitcrcormectioit  network  and  are  tlic 
topic  of  future  rcscardu 

Parcels,  .yppLication  code,  and  data  coirtain  virtual  .ad¬ 
dresses.  To  transhtte  these  addresses  without  the  o\erlica<l 
of  mahrtaiuing  comeirLLonal  page  tables  at  cadr  node,  we 
ckssify  D1\’A  memory  according  to  lusage  |9]:  (1)  glohal 
memory  visible  to  the  liosL  and  PIM  nodes;  (2)  du/nb  mem- 
ory  .allocated  as  conventional  pages  in  a  liost  appheation's 
virtual  sp.ace  and  untoudied  by  PIM  node  processing:  aiul, 
(3)  load  memory  used  cxdusively  b>'  PDI  node  routines,  'lb 
condciksc  traiisktioii  infonnation,  ratlier  tluur  page  tables, 
we  use  segments,  cadi  of  wiildi  is  defined  by  segment  reg¬ 
isters,  as  dtseussed  in  Section  3.4.  Lr  .adihtioir  to  local  seg- 
iiieirts,  a  node  maintains  transktion  iirformaUon  for  its  por¬ 
tion  of  glolial  memoty.  Remote  addieincs  arc  translated  vk 
the  concept  of  a  home  node,  wliiclr  Ls  guaiairtoed  to  Lave  tlic 
traiLshatiorr  '20].  'llius,  each  node's  portion  of  glolial  mern- 
orv"  includes  objects  for  wiiich  it  is  tlie  liomc  node,  'llie  ma¬ 
jor  avhvantagcs  of  tliis  .approadi  arc  tliat  trairsktion  may  le 
accomplished  lapully,  aiul  trairsktion  informatioii  on  cadi 
PIM  scales  weL 

Memory  managernent  funcUonahty  Ls  distrilnited  among 
the  liost 's  standard  operating  system,  augmented  with  sup¬ 
port  for  PIMs,  and  run-time  keriieLs  on  PDI  processors.  Un¬ 
like  stamhard  luultiproecssor  sa'i^toms,  tlie  liost,  wiudi  luts  a 
fustem-level  view',  rentaiits  a  central  figure  in  system-level 
sehcdarling,  disk  I/O  operations,  ami  memory  majuagemeuL 
'I'lie  PIM  run-time  kernel  must  colkborate  with  tlie  host  ou 
svaitem-level  operatioiLS,  sudi  .as  loailiiig  PIM  programs  and 
(lata,  memorv'  management  of  PDl-v'tsible  segments,  and 
PCvI  context  switdies  Ijctween  diflcrcnt  user  progriums.  'I’lie 
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duUkiij^  ill  tills  coLkiboratioa  Is  Uuit  two  deli's  of  iiieiii- 
ort'  must  be  iiuu]Lt<ujiC(L  Foe  iliuiib  pnges  luul  for  <lisk  I/O 
of  PDiI-ilsible  sesmciits,  tlie  liost  secs  memory  as  sUmlanl 
4Klri'te  pftgcs;  tlie  PIM  ruii-time  kernel  iiLStcail  liews  PIM- 
VLSible  iijcmori'  as  mriAblo-siaed  segments  lO], 

3.  1)I\ A  IMM  MICROARC  im  KC  I  L  Ri; 

Eadi  DIVA  PIM  dilp  is  a  VTSI  memori'  (fciice  augmented 
n'itlL  general-purpose  computing  and  communication  luard- 
ii'are.  .Vltliougli  a  PIM  maiV  consist  of  innlUple  inxlcs,  each 
of  wliidi  are  priiiiaiily  comprised  of  a  fen'  megal^'tes  of  incm- 
ori'  and  a  node  processor,  Figure  3  dioivs  a  PIM  iritli  a  single 
node,  wliicli  reflects  Uie  focus  of  tlie  initial  rcscaidt  tliat  is 
being  cotiducted.  Nodes  on  a  PIM  dup  sluare  a  host  interface 
and  a  single  PIM  Routing  Component  (PiRC).  llie  host  in¬ 
terface  implanents  tlue  .IFIDEC  standard  SDR>\M  protocol 
so  tliat  mcinori'  accesses  as  well  as  parcel  actidti'  initiated 
by  the  liost  appear  as  coniTentional  memori'  accesses  from 
tlie  host  perspectii'C.  Tlie  PiRC  is  respoiLSible  for  both  rout¬ 
ing  parcels  off  diip  m  tlie  PIM-to-PIM  intercoimect  aiul 
diretting  parcels  on  diip. 


Figure  3:  DIVA  PIM  drip  ardiitocturc. 

Figure  3  also  shell's  tn'o  intereoiuiects  tluat  span  a  PIM 
chip  for  inforiiLatioii  flow  bctii'ccii  iioiles,  Uie  liost  inter&cc, 
and  tlie  PiRC.  Each  interooimeet  Ls  distinguislicd  b)'  the 
t>'pc  of  information  it  carries,  'llie  PIM  memory  bus  is  used 
for  coni'cntional  mcinori'  accesses  &om  the  liost  proccfBor. 
'llie  parcel  interconnect  alien's  parcels  to  transit  between  the 
liost  iiiterfaco,  tlie  nodes,  aul  the  PiRC,  The  liost  interface 
also  eontaiiis  a  parcel  buffer  (PBUF)  for  parcel  communi¬ 
cation  bctn'ceii  liost  and  PDL  Each  PEI  node  also  lias  a 
PBUF,  for  nodc-to-iiode  parcel  canuimnication,  as  will  be 
disciuKcd  ill  Section  3.3, 

Figure  -I  slion'S  tlie  major  control  and  data  coniiectious 
ii-itliin  a  noile.  'llie  DIVA  PEI  node  proccfsiiig  logic  su|>- 
ports  suigle-l«ue.  in-order  execution,  iiith  32-bit  iiLStrue- 
tiotis  and  3a-Ut  adilrcsscs.  'Fliere  are  two  datapaths  wliose 
acUotLS  are  coordinated  b>'  a  single  cecccution  control  unit: 
a  32-bit  scalar  ilatapatli  tliat  performs  opcratioiLS  suiiilar  to 
tliose  of  stanilartl  32-bit  integer  units,  and  a  2!Xi-bit  WUle- 
VVord  datapath  tliat  perfonns  fine-grain  parallel  opeiatious 
on  3-,  1G-,  or  32-bii  operands.  Botli  datapatlis  CKCCutc  from 
a  single  iiLStruction  stream  niuler  tlie  direction  of  a  single 
5-stage  DLX-like  pipeline  [11],  complete  with  register  foi^ 


warding  logic  to  rcsoh'C  data  dependence  Iiasards.  'Fliis 
pipeline  fctclres  iustructioiLS  from  a  snuall  instruction  cadic, 
wliicli  is  included  to  minimize  memory  contention  bctsi'ccn 
iiLStruction  reads  anxl  data  accesses.  Tlie  iustruetion  set  luts 
been  designed  so  both  datapatlis  can,  for  tlio  most  part,  use 
the  same  opcodes  aiul  comlition  codes,  gcaieratiug  a  Large 
fuiicticaual  owLap.  Tlie  scalar  datapatli  Ls  a  standard  RISC 
ardutocture,  augmented  with  a  few  DE  A-specific  functioiLS 
for  coordiuatiiig  with  the  wide  datapath.  Tlic  VV’kleW’ord 
datapath  accesses  tlie  scalar  registers  for  addressing  opera¬ 
tions,  as  well  as  for  eontroUing  superword  operatioiLS.  Each 
datapath  Las  its  own  independent  geaietal-pnrpose  register 
file  with  22  registers.  Special  iiistiuctions  permit  direct 
trajLSfers  between  register  files  witliout  going  througli  menir 
or>'.  .VlUiongli  not  supported  in  tlie  initial  DIVA  protot>'pc 
fdiOftiL  In  Figure  L,  floating-pomt  extciLSioiis  to  the  W'iilc- 
VVoril  -unit  will  be  pro\ideil  in  future  s>'SteiiLS. 


Figiuy>  4:  DIVA  PIM  nod*  organkaatlon. 

dlie  cacecution  control  unit  supports  supcr\i.sor  and  user 
monies  of  processing  and  also  inaintauis  a  nunber  of  special- 
purpose  aud  protected  registers  for  support  of  cscception 
liaiulling,  address  traiLslaUoti,  aiul  geneial  OS  sciwiccs.  EIx- 
ceptious,  .arising  from  cxecutioii  of  node  instructious,  .and 
interrupts,  from  oilier  sources  sudi  as  an  internal  timer  or 
external  interrupt  signal,  are  liaiulled  Ii^'  a  cotiiuiron  meclua- 
nlsm.  llie  cscception  liaiulling  scheme  for  DEA  lias  a  mod¬ 
est  liardwarc  requirement,  cscportLug  iimeli  of  Uio  eomplexit)' 
to  software,  to  inauitain  a  flescLble  ini.plementa.tion  platform. 
It  pro^'ides  an  integrated  mediaiiism  for  liaiulling  liaidware 
and  sollwaic  cscception  sources. 

'Flie  following  sectioius  present  the  DEA  PEI  node  in 
more  tletail  .and  lii^ddigld  some  of  the  unique  features  of 
the  DIVA  inicroardiitecturc,  'llie  first  sutacctiori  focuses 
ou  tire  most  dlstiiiguisliiug  feature  of  a  PEI  .as  compared  to 
a  courait tonal  processor,  its  memory  unit  ami  memory  in- 
terfivcc.  Subsequently,  ne  describe  DEA’s  VVLdeWord  unit, 
parcel  intcrcoiuicct  .and  .aldress  traiisLatlon  meclianisiiL. 

3.1  Host  Memory  Intcrt'acc  nnd  Memory  Unit 

'llie  liost  interfecc  .and  memory  unit  reflect  a  number  of 
the  cluallenges  in  desigiung  a  PEI  that  serc'cs  as  a  siiuart- 
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iiiciiiMy  coprocessor  to  a  coii«iitioiuU  liost.  Our  uiulcrly- 
iiig  koaIs  vNicrc  to  iiLiiiiiiiizc  pcrforiiu'Uicc  penalties  to  cou^’Cn- 
tiotial  memory  accesses  as  \T<?«Tcd  Iw  Uie  liost,  wldle  iiuax- 
irnmjw;  the  potential  benefit  of  PI^i  operatioiiSi  Althoiydi 
tine  origuual  design  tan^txd  «ubc^l^IlC«l  DR-UI,  tlic  proto 
t>'pe  sliowii  ill  Fioire  t  is  an  SR-U^I-ljased  desisi  dire  to 
clialleaifios  iu  timely  access  to  embedded  DR.WI  fabrication 
liiues.  \Ve  first  describe  the  lesultinE  <ksigii  tmplcmieuted  in 
tilts  pratot>T)C  cltip  aiul  tliei;  present  necessary  consLdera- 
tioiis  for  a  DR.Uil-lased  PIM  desiou 
A  PEN!  cliip's  host  uiterfitce  externally  implements  the 
•TED EX?  SDILMiI  protocol  so  tluat  tlie  PBI  appears  as  a  coai- 
veiitional  SDR.-\M  to  Uie  liost  processor.  On-chip,  the  liost 
uitcr&cc  communicates  nith  an  internal  incmori'  controller 
to  negotiate  access  to  the  embedded  memor>-.  Iiicsscxice,  the 
liost  interface  Is  a  translator  bcUeen  the  standard  SDILAM 
protocol  aiul  an  intcriiaL  PIM-spccific  protocol  To  satisfi' 
tlie  tuning  requirement,  tliis  interface  must  eusuic 

coiiststeut  tuning  for  liost  memor>'  accesses  .\t  first  gluiee, 
tilts  may  seem  difficult  to  enforce  since  Uie  PIM  node  proces¬ 
sor  may  be  accessiiig  inanor>'  ivhen  a  host  mC3ni0r>’  request 
airh-es,  therein'  causing  tlie  host  acccfs  to  iiiciu-  an  .addi¬ 
tional  latenci-.  Ho>wct,  a  couple  of  £vctors  allow-  the  PIM 
to  respoiul  with  predictable  latency  .as  required  by  the  stan- 
danl  First,  tlie  embeddcil  SILVM  iiuacro  of  tlic  prototype 
chip  lias  a  AotLS  Ihiie  ami  2rMni  data  bus.  Seeoudly, 
tlie  intcnual  clock  of  the  PIM  ts  .at  least  twice  tlat  of  the 
SDR-VM  hus  (4X  for  some  uiiplementatious),  so  tlic  .aildition 
of  .an  arbitxation  c>'cle  is  ncghgible  to  tlie  oa'crall  memoiy'  la¬ 
tency.  Refer  to  Figure  o,  whidi  sliows  a  timing  diagram  for 
a  3-cyde  C.\S  Latency  SDR.-\M  Iwjrst  rcivd  operatiosL  Tlic 
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Figure  5:  SDRAM  burst  read  timing. 

worst-case  read  latency  occurs  if  Uie  memoia-  Ls  busy  satisfy¬ 
ing  a  PDI  proecisor  request  w'hen  tluc  liost  request  arriaes. 
Ea-cn  in  tliis  case,  once  the  C.\S  strobe  of  ca-cle  3  in  tlie  fig¬ 
ure  lias  been  detected,  there  is  oidy  a  T-PIM-ca-de  latency  (2 
SDR.\M  (ri'des)  for  tlie  read  request  to  be  forwarded  from 
tlie  host  interlace  to  the  intenud  node  memory  controller, 
scraiccal,  .and  returned  for  output  onto  the  SDR.VM  data 
bus.  Tilts  allows  the  PIM  to  output  tlie  first  Ulrbit  data 
w'onl  in  ca-cle  G,  sattsf>-ing  tlie  SDR-VM  protocol  Since  all 
accesses  to  Uie  embedded  memory  involat  2jG  bits  of  data, 
tlie  succeeding  G4-Ut  data  w-ords  are  reailily  aaailalile  for  the 
liost  interface  to  output  Uieni  in  cycles  7,  8,  and  9.  Similar 
timing  .applies  for  write  accesses. 

Tlie  uitcni.al  node  mcanor>'  controller,  sliowii  in  Figure  4, 
consists  of  two  Ijasic  components;  .an  arbiter  aiul  a  mem¬ 
ory  cvrihvl  unit,  'llic  arlatcr  performs  liamlsluaking  be¬ 
tween  requesters  cif  memory  accesses  aiul  detenniiics  pri- 
orit>-  of  competing  requests.  Requests  for  accesses,  i,e.. 
reads  and  writes,  may  originate  from  the  host  Lnter&ce  mem- 


or>'  port,  PDI  processor  uLStructiou  caclie,  ,and/or  memor>’ 
stage  of  the  PDl  processor  pipeline.  Arbitration  priorities 
are  fonnulated  aa  follows:  1)  liost  interface,  2)  PDl  proces¬ 
sor  memory  stage,  and  3)  PDI  proocfflor  instruction  cadic. 
Tlio  liost  Lnterfiaw  lias  the  higjicst  priority  since  muum.al  la- 
tciKri’  Ls  required  for  tlic  PDI  eliip  to  appear  as  conaentional 
SDILVM  for  host  processor  .accesses.  The  arbiter  coimnwii- 
cates  closely  wiUi  Uie  memory  control  unit,  w-liidi  is  respon¬ 
sible  for  generating  .all  control  signals  to  tlie  memoiy  .ana,v, 
such  as  macro  select,  wnrite  enable,  output  enable,  and  .ad- 
dres?  Uts.  Once  a  requester  lias  been  granted  access,  the 
requesting  source  dri\'es  tlie  dat.a  1ms  arul  .associated  Ijyte- 
write-en.able  signals  for  wTite. accesses  w-hile  Uie  memory  con¬ 
trol  unit  dris'cs  tlie  control  signala  For  read  acceses,  the 
requester  simply  latdics  data  returned  from  tlie  ineiuoiy  at 
the  appropriate  time. 

For  future  DILVM-lased  implemcrit,aLions,  tlie  PDl  cliip 
memori'  s>'Stem  must  be  .augmented  to  support  rcfrcsli  op¬ 
erations  .and  page-mode  acccfBOS.  Fbr  refresh  operatioiLS, 
the  liofit  interface  must  be  able  to  trartslate  s>'Stem  meinor>' 
controller  refrcslL  operations  into  internal  refircsli  operatmts, 
wliieli  is  a  fairly  straiglitforward  exercise.  'Ib  csrploit  page- 
mode  accesses,  Uie  node  mernoiw'  eontioLicr  sliould  maint.ain 
a  current  page  address  register,  Fbr  normal  read/write  ac¬ 
cesses,  the  address  presented  with  the  request  is  compared 
against  tlie  contents  of  Uie  current  page  address  regtster, 
assuming  tliat  a  p<agc  is  currently  operu  If  tlie  portion  of 
the  rcqniesting  aililrcss  w-hidi  dcsigiuates  the  DFLVM  page 
raatdics  the  \'alue  of  tlie  current  page  adlrcfs  register,  tlie 
accesi  is  perfonned  as  a  page-mode  access.  If  tlie  tw-o  i-alues 
are  unequal,  a  random  access  must  be  performed,  wiiich  en¬ 
tails  restoring  tlie  currently  open  page  .and  strobing  in  tlie 
new  pagje  corrcspotiding  to  tlie  access  request.  Simultane¬ 
ously  with  tilts  access,  Uie  new  page  aullrcss  is  latclicd  into 
the  current  page  address  register.  .Vlso,  the  current  page 
aUlrcss  register  is  imalHatcxL  upon  refresh  operations,  siiiee 
refresli  operatious  corrupt  tlic  lalucs  retained  in  the  DR.-VM 
sense  .amps,  wliicli  represent  the  currently  open  page. 

Also,  DR.VM-ba.secl  PDl  implcmcaLt.ations  must  carefully 
consider  Uie  SDILVM  interface  reqniremerits.  .As  an  exam¬ 
ple,  coiLSller  an  implcmcait.ation  liased  on  tlie  DR.VM  macro 
provided  b>'  Uie  IBM  Cu-11  process  ]L3l.  Like  the  SILVM 
macro  used  in  tlie  first  DIV  A  protota-pe  cliip,  tlus  macro  sup¬ 
ports  a  SXiUt  data  Ws  with  lyte-write-enable  signals  to 
.support  w'rites  of  data  siuallcr  tlian  2jC  bits,  wliere  needed 
Tlie  macro  page  size  is  204s  bits.  Tlie  page-mode  O'cle  time 
is  ooLS,  while  tlie  raiulom-mode  O’cle  time  is  IOils. 

If  tlie  s>'Stcm  memory  controller  alwa.a'S  initiales  full  burst 
requests,  like  Uic  one  sliowai  in  Figure  5,  small  modifications 
can  be  made  bo  the  internal  PDI  logic  to  satisfy  tlie  SDR.VM 
timing  requirements.  .Vs  soon  as  a  scTStcia  memoiy  controller 
R.VS  O'Clo  Ls  detected,  tlie  hest  interface  should  alert  tlie  in¬ 
ternal  memory  controller  to  finlsli  its  current  memory  oper¬ 
ation  .and  remain  idle  in  anticipation  of  a  host  nequcsL  Even 
with  Uie  Current  liiglicst-speed  SDR-VM  standard,  L33MIl£, 
there  is  a  3iins  latency  between  Uie  R.VS  c>'cle  and  when  data 
is  required  for  read  operations  for  a  3-c>'cle  C.VS  Latency  im- 
plciiientatioiu  Based  on  the  random-mode  cycle  times  cof  tlie 
IBiM  DR.VM  macro  mentioned  aboxe,  Uds  is  adequate  time 
for  the  internal  iicxle  memory  controller  to  complete  its  cur¬ 
rent  meiuor>'  c^'cle  arul  respojul  to  a  host-initiated  memor>' 
c>'cle.  For  s>'StC3ns  with  intelligent  memory  controllers  Uuat 
perfewm  page-mcxle  accesses  (initiating  C.VS-oaily  mcmoia' 
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opwatioaus),  tint  C.\S  UtcaiO'  must  be  coiifugured  to  support 
tlie  inAxiinuin  PHI  liiteucy.  'flie  rcsultinfi  jncinory  Latency 
pcjuUty  is  liijtlily  s>-steuxliepeiu]ciit  Lii  tlus  case. 


3.2  NMdeWorcl  I'liit 

Tlie  WiileWord  unit  operates  on  2»bit  w'onls,  enaljUiu; 
applicatioiis  to  exploit  fiue-giaiu,  or  supcnvord-Level,  paral- 
Lelisii  I  aiul  tlie  iiicnswed  processoMncmorj'  liaiulwkltli  avail¬ 
able  iiL  a  PIM  node.  Tlic  WkleWord  unit  lia.s  the  abilitj’  to 
dian£e  operaiul  vvixltli  on  a  pcr^uistiuctioii  1ia.sis,  eiuabliiij;  it 
to  treat  a  WUleWord  operaiul  as  a  packed  array  of  objects  of 
S,  16,  or  32  bits  in  sure.  With  tlie  exception  of  a  few  spccialr 
iaevl  iiLStructioiLS,  tliis  diaiaeteristic  means  tlie  WuleWonl 
.\LU  is  more  fseneraUy  represented  as  a  iiuiuljcr  of  parallel 
.\LUs,  wliere  the  number  depends  on  opciand  size. 

Besides  conventional  arithmetic  aiul  k%ic  operations,  the 
W’uleWaixl  unit  also  supports  a  ridi  set  of  operatioiLS  for  ina- 
iiipuLating  data,  indudinj;  rearrangement  of  data  vvitlun  a 
WUleWord  opcxaiul,  transfers  between  WuleWord  and  scalar 
registers  aiul  packing  aiul  unpacking  operations,  Furtlier* 
more,  tlie  WideWord  unit  supports  sdcctive  execution  of 
instructions  on  a  per-datapatU  Ijasis,  depending  on  the  state 
of  condition  codes.  'Hie  gcnciality  of  tlicsc  three  features, 
iV5  vveE  as  tlie  .ability  to  access  main  memory  at  very  low 
Latenc>’,  distinguLsli  tlie  DIV'.V  WideWord  ca.pabilities  from 
multimedia  ISi.\  c.xtcn.sion.s  such  as  Intel  SS£i2  arul  PowerPC 
.VltiV'cc,  as  well  as  suliwonl  paialleEsm  .approadics  sudt  as 
M.VX  do].  We  now'  discuss  eaeh  of  these  In  detail,  aiul 
sliow  examples  of  their  use.  In  the  examples,  we  nsc  the 
convention  tliat  WkleWocd  iiustructifln.s  and  references  to 
WuleWortl  regtsters  are  both  prepeiuled  with  a  'w'’. 

P<emiutattoin.,  To  rapully  align  aiul  reorganize  data  in 
WuleWortl  registers,  the  WuleWord  unit  lias  a  pennuLaUon 
functional  unit,  wliich  enables  lany  t>-bit  flehl  of  the  source 
register  to  be  moved  into  .any  S-bit  field  of  the  dcstiiuation 
regtster,  pcnimtation  ts  spocLfictl  by  a  permutation  vec¬ 
tor,  w'liidi  contains  3i2  indices  corrcspoiuling  to  the  3S  t$-bit 
subfields  of  a  WideWord  destination  register,  w'here  eadi  in¬ 
dex  selects  w'hidi  snlrfiehl  of  tlie  source  data  is  moved  into 
tliat  destinatioiL  field.  General  pcriuutiatious  lare  specLtied 
sudi  as  vrpmi.  TirojUTijUip,  wdierc  wrp  spccLlics  the  desired 
pCTmut,ation  vector  to  be  applied  to  input  wri,  witli  the 
output  in  UTO.  "Wrp  is  either  constructed  through  a  se¬ 
ries  of  iiLStmctions  or  is  loaded  from  mcinor>'.  To  Ixypass 
tlie  cost  of  constructing  or  loading  general  periuntations, 
commonly  used  pcrmntatioiis  arc  Instead,  specified  sudi  as 
vipnii  vro,vxi,sx,  where  sr  Is  a  scalar  register  tliat  coai- 
tains  an  iiulex  into  a  table  of  luard-wired  pcniiutation.s,  sudi 
.as  sliifls,  rotates,  slmifles,  gatliers,  scatters  aird  reductions. 

Figure  0  illustrates  tlie  use  of  pennute  operatioiLS  with 
an  example  of  a  reduction  sum  of  the  dements  of  an  array 
(loaded  into  utU.  Tlie  reduction  sum  Ls  perforined  in  login) 
steps,  W'here  n  is  tlie  number  of  elements  in  urrl  (in  the 
examples,  n  =  4,  for  simplicity).  On  each  step,  the  first 
permutation  swaps  cadi  even-numbered  field  2f  with  its  odd- 
nuinbcrevl  nchdibor  field  2t  +  1,  0  <  i  <  n/2.  Storing  the 
result  in  u'r2.  'Tlieii  the  contents  of  uri  aiul  wr2  are  added, 
resulting  in  tlic  sum  of  each  pair  of  even-/ovlil-uumbered 
elements  in  eadi  cvcu-numljcred  fiehl  of  utI.  Finally,  all 
eveai-nnmbcTcd  partial  sums  are  permuted  into  the  lower 
lialf  of  utL,  reducing  tlic  problem  size  b>'  lualf.  .\flcr  the 
Last  step,  the  sum  of  aU  dements  Ls  in  field  zero  of  utL 


//  srl  PcfiTs  to  paTOiWatkm.  (LQ.3.2) 
//  sr2  refers  to  ptrmiwatioa  (ai-US) 


wkl  wTl.i;anTi.v): 
//  stqv  1: 
wprmi  wTiwrUsrl: 
wadd  wTUwrl.wri: 
wprmk  wrUwrLsri: 
//  stqv  2: 
wprmi  wT2.wrl.srl: 
wadd  wTl.wrl.wr2: 


'  wTl  =  (aO.al.a2.B3) 

//  wt2  =  (al.aa.aiB2) 

,.'7  wTl  =  {aO+aI.aD-i-al.a2-t-a3.a2+B3) 
//  WTl  =  (aO+aLa2+aS.’‘.*) 

//  wT2  =  (ai+aiaOTal.*.*) 

//  WTl  =  (a0+al+a2+ai*.*.'*() 


Figure  6:  Rcductkxn  sum  using  pcmiiitations. 


Registor  TVansfbrs.  lb  ciuahle  efficient  vLata  movement 
between  scalar  aiul  WideWord  register  files,  the  WideWord 
unit  supports  a  set  of  tiaiLsfcr  instructioiut  Tlie  transfer 
instruction,s  injdiule  hvsht  vr,sr,  wludi  replicates  tlie  con¬ 
tents  of  scalar  register  sr  into  all  fields  of  WideWord  register 
ur,  avsw  vr,sr,i,  wliidi  copies  sr  oady  to  the  immediate  c 
field  of  ur,  .andiaws  stiVt,!,  wludi  copies  tlie  contents  of 
sr  to  Lmmediate  i  fidd  of  ur. 

Figure  7  illnstrates  the  use  of  transfer  instructions  in  a 
mixed  irrcguLai-reguLar  computation  wdiere  it  is  advant.v 
geous  to  perform  Uie  reguLar  computation  in  tlic  WuleWord 
unit  aiul  tlie  irreguLu'  in  the  scaLar  .\LU.  In  Uiis  example,  tlic 
multiplication  of  .-I7.j  :  .■lX'+3i  by  .V  can  be  perfoniied  in  Llic 
WideWord  unit,  since  arKtv  .1  is  accessed  vvitli  stride  one. 
To  allow  tlie  vector-scalar  multiplication  to  be  performed  in 
paraUeL,  .Y  is  repUcated  into  a  WideWord  register.  It  is  not 
profitable  to  perform  tlie  addition  of  and  A'k]  *  X 

in  tlic  WideWord  unit,  since  it  w'oukl  be  neccssarv’  to  pack 
to  Y'R’k  -F  311  bito  a  WideWord  register  aiul  dieck 
vvlictlier  RT'l :  /1Th-3|  arc  distinct  values.  Nevcrtlielcsj,  tlic 
computation  of  tlie  ■addresses  (jxwe  address  of  )'  plus  offsets 
/tT] :  R'k  +  3])  can  stUl  be  performed  in  parallel,  as  diown 
in  the  exampla  Finally,  tlie  operands  .are  movevl  to  scaLar 
registers  and  the  .additions  axe  performed  in  tlie  scaLar  unit. 

Scloctlvc  eocecution,  Selective  e.xecntiou  is  supported 
for  most  arithmetic  aiul  logic  iiLStructions,  permutation  in¬ 
structions  and  some  traitsfer  instructioiLS,  Under  selective 
execution,  oidy  tlie  results  corrcspoiuding  to  participating 
subfields  are  written  lack  to  tlie  destination  register  spec¬ 
ified  in  the  iiistructicnu  Tliercfore,  tlie  implementation  of 
selective  exccutioii  requires  th<at  w'riteback  ciuahle  bits  le 
asMciated  with  eadi  S-bit  snlrfield  of  the  .VLU  result,  wiiich 
complicates  tlic  register  forwarding  logic  somewiiat.  'llie 
detennination  of  w'lietlier  a  subfiekl  participates  in  the  exe¬ 
cution  of  a  given  instruction  Ls  derived  from  couditiou  codes, 
two  special-purpose  registers,  and  a  field  in  the  uistructioiL 
One  special-purpose  register  Ls  a  u.sci-settablo  32-l)it  mask 
register,  w'liere  cadi  lilt  corresponds  to  an  8-bit  subfield  of 
the  operation.  'Tlie  participation  mode  roister  is  a  5-bit  reg¬ 
ister  tliat  spedfies  the  condition  for  selective  execntiou  a.s  a 
comlwuation  of  condition  codes  aiul/or  mask  register.  .\  2- 
bit  participation  field  in  Uie  instructioii  spedfies  one  of  four 
possible  extents  of  selective  execution:  local  partidpation, 
wlicre  a  subfickl  participates  if  its  local  condition  (as  derived 
from  tluc  partidpation  mode  aiul  mask  register  values  <uul 
conxlition  codes)  Is  trues  lefbiiost/xiglitmosL  partidpation, 
w'licre  oidy  the  leftmost/riglitiiiost  sulifield  with  a  condi¬ 
tion  tLat  Ls  true  partidpates:  and  alwia,vs  participate,  w'here 


288 


/,■’  <^ifiinal  loopc 
//  for  <  k  =  Ot  k  <  K:  k 

,7  '^TTWi = +  Aim  •  X 

.7  srl  =  l)QSc  Eiddross  of  Y  (4;\T 
,7  sr2  =  adtlTWS  of  A(k]  (SiAflsJ) 

,7  sr3  =  aikliais  of  X 


//  ajmpiitc  A(li]*X:A(k+3l*X  in  WldcWonl 
wld  ■wrl»sr2:  ,.7  wrl  =  (A[kl:A[k+3|) 

kl  sp-l.sr3t  .7  sr  l  =  X 

in\'sw'r  wrSksrl:  .7  wrt  =  <  X.X.X.X) 

•wnnil  wtS-wtI-wtS  //  wrG  =  (A[kl''X:A[k+3l“X) 


/,■’  oimpiltc  addraiscs  ii^■[^|kD:i;^’[^^k+3ll  in  WulaWbrd 
m\'swr  wrlJifl:  //  wrl  =  ^HV-kY-kY^Y) 

wkl  ■«'r2.i;nj;kj:  ,7  wrt!  =  <;R[k|:n(k*3j) 

■wsU  wTlwTli  //  oomTTt  to  Bddrras  oftet 
■wtuld  wt3-wTUwr2  //  wifl  =  (4:A'^I^kJ|:i;^'Pl|}i^-3|]) 


,7  amipiiitc  Y[R(liJld-A|kl*X  in  sralar  unit 


mvws  srlOtwrtLCe 
ki  3-l.l.srl.tt 
mt’ws  Hrl2.wr-l.ft 
add  !!rl3.srl.l.Hrl2 
at  arlOwrlft 


'  arici  =  kY 
'  arl  I  =  Y 
'  5rI2  =  A{kf 

■  arl3  =  Y^ 

■  W1I  =  V 


,7  annpiiito  Y[n[kr-lII-l-AJlL-t-ll*X  in  acalar  unit 


Figure  7:  Irregular  computatkm  using  boiuli  Wido- 
Word  and  scalar  ALUs. 


all  suljflelds  participate.  .-Mtliougli  siinilai  desiga';  sappert 
some  t.t'pe  of  ooiulitional  operations,  tlie  DIV’Y  WulcYVorcl 
luiit  pro\'idcs  a  iuucIl  lidicr  fuiictiotiaJity  tlirougli  the  ability 
to  specify  selcctit'c  csoccution.  in  almost  et'cry  wide  uLStnie- 
tion  and  tlie  use  Oif  glolial  conditioit  code  iiifonnation  in 
selection  decisions. 

Tills  dlstincticai  Is  iUnstiateJ  in  Figure  8.  Tlic  ULStruetioii 
vsuboc  subtracts  A'  from  elements  of  ana,y  C  and  sets  the 
condition  code  for  cadi  32-bit  field.  'Ilieii  the  subsequent 
uistructioii  vaddlc,  wliere  Ic  specifies  local  participation, 
perfoniis  an  aildition  only  on  those  fiekls  for  wliicli  die  GT 
condition  code  is  set 


,7  Ori^nal  kxupc 

//  for  <k  =  0:  k  <  N:  k  -l-k) 


,7  set  partldpadoji  raodc  reosto"  (I'M) 
ori  r2.  rO.  "GT" 
mtspr  PM.  r2 


wkl  wrt  i:A: 
w'kl  witJ.  iiDe 
w'kl  wr3.  liCS 
id  rt  SiS: 
wmvswr  wTtrl: 
wsubcc  •wr5.wr3.wTl: 
waddle  wTt  wTl.  wr2: 


//  wrl  =  (BD.aI.a2.a3l 
//  wrl  =  (b0.bt1>2.b3) 

//  wrl  =  fc(J.ctC2.c3l 
rl  =  X ' 

//  wr  l  =  iX.X..X.Xl 

//  wr5  =  (c()-.X.cl-X.c2-X.(3-.X) 

// if(C>  X)  A  =  A-B 


Figure  8:  Selective  update  example. 


3.3  Parcel  liitercoiiiicct 

Emi  for  applicatioiLS  wiierc  tlie  WulcWord  instructions 
■are  not  applicable,  tlie  WideWonl  datapatli  is  used  to  .accel- 
eiateaU  parcel  communication,  as  will  be  disciLSsed  liera.  .’Vs 


described  earher,  tlie  PIM  Routing  Component  (PiRC)  not 
oidy  implejnents  tlie  PIM-to-PIM  iiitercoiuiect  but  also  iii- 
tciacts  with  parcel  Iniflcrs  (PBUFs),  the  liasic  on-diip  lianl- 
ware  inccliaiiisms  supporting  parcels.  Tlie  PBUF  luw  a  vir¬ 
tual  as  well  as  a  ph.vsit^  Aljc».raction.  To  the  applicatioai.  tlie 
PBlT  locations  appear  .as  regular  memory  locations  tliat  .are 
manipulated  tlirou^i  simple  loads  and  stores.  .'\t  a  phi'Sical 
101x1  tlie  PBUF  Is  a  set  of  mcanoiy-mapped  registers.  E.ach 
PIM  node  contains  a  PBUF  tliat  ser\xs  as  a  port  between 
the  on-diip  parcel  interconnect  atul  tlic  noda  .-Mtliouj^i  a 
PIM  node’s  PBUF  coukl  be  implemented  as  special-puipose 
registers,  a  memory-m.appod  ineduuiism  allows  a  uniform 
implement. ation  for  botli  node  aiul  liost  PBUF,  'Hie  PBlT" 
witliin  tlie  PIM  diip  liost  interface  is  memorv-mapped  into 
the  liost  procefKor’s  .aldrcss  space  to  pennit  host  and  PEI 
parcel  coimnuiiicatioiL 

\  pared  catslsts  of  a  tKt-bit  lieadcr  aiul  250-bit  payload. 
Most  of  the  parcel  eonteiiUi  are  wTitteai  by  the  user  progriun 
during  a  parcel  hiundu  liowxaxr,  the  si’Stem  is  responsible 
for  genorating  some  fields  sucli  as  PiRC  routing  information, 
source  node  ID,  process  identifier,  .and  interrupt  status.  'Ilic 
user  program  is  responsiljle  for  specdfviiig  licader  fields  tluat 
include  tlie  virtual  aiklress  of  tlie  object  to  wludi  tlie  p,arcd 
is  directed  and  a  specification  of  tlie  command  to  execute  on 
that  object,  hi  addition,  the  user  program  specifies  tlie  2jC- 
Int  payload,  wliidi  consists  of  arguments  for  the  coimnand 
ta.sk  or  other  data  .a.ssociated  with  tlie  .action  specified  b>’ 
the  parocL 

I^ta  is  wTitten  to  or  read  &om  tlie  PBIT”  in  250-1  fit  Ln- 
crcinents  m  tlie  WulcWord  registers.  'Tlic  PBIT  address 
sp.ace  can  Uicai  bo  viewed  as  a  set  of  250-bit  regtsters.  Be¬ 
sides  the  licailer  aiul  pa,vloeul  registers,  there  are  also  status 
and  configuration  registers.  .'UUioujdi  tlie  payload  Ls  tlie 
only  true  pla-sical  250-Ut  register,  each  register  is  allocated 
250  bits  of  the  .address  space  .and  is  aligned  to  the  least  sig¬ 
nificant  bit  boundary.  .-Yt  least  two  register  sets  are  ncedeiL 
one  for  seiuling  and  one  for  rccci^irlg  In  .additiou,  it  is  de¬ 
sirable  to  Laix  multiple  atklrcss  luappuigs  (aliases)  of  these 
sets  to  support  diflerent  acces  privileges  aiul  modes,  such 
as  non-launching  and  laundiing  wTites  to  tlie  send  registers, 
dcstrncthx  and  non-destructiw  reals  from  the  receive  reg¬ 
isters,  and  interrupt  capability.  Tlie  DI\’A  design  includes 
several  aliases  to  support  sudi  mediaiusms, 

3.4  Address  rruii<ilatu)ii  llurdvvare 

'llic  primary  functiOMS  of  tlie  luxle  address  translation 
unit  are  to  traiusLate  vfirt'ual  aldrcsses  to  physical  addresses 
for  tliose  accesses  wliidi  are  locally  rcsUkiit  and  to  pro¬ 
vide  ivcceBS  protection,  Tlie  tv-pcs  of  .accesses  generated  bv 
a  DIV'.'Y  PIM  proccifior  tliat  require  translation  include  in¬ 
struction  fetdies  aiul  data  accesses  to  memor>'  or  memorv- 
mapped  devices  such  as  parcel  Iniffcrs,  geiicxated  by  load 
or  store  instructions.  Given  tlie  siinpUcity  of  tlie  segment- 
Ixvsed  address  translation  sdieiiie  disenssed  in  Section  2, 
vxrv'  little  lianlware  support  is  needed  to  effect  efficient  trans- 
latifliu  The  iiecesaiy  desaiptors  for  a  local  mcinoo'  seg¬ 
ment  are  a  pliv-sical  Iswe  address  regtster,  offset  limit  reg¬ 
ister,  aiul  access  privilege  control  bits.  For  glolxvl  memorv' 
segments,  an  alditioml  virtual  liasc  aildrcss  register  is  nse- 
ful  to  effect  efficient  translation,  as  described  below,  'Tlie 
iiiirial  DIVA  ardiitecture  provides  eight  sets  of  local  seg¬ 
ment  registers  aiul  four  sets  of  globivJ  segment  registers.  If 
an  applicatiaii  requires  a  number  of  segments  tliat  is  more 
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tlu'UL  UiAt  Ruppoticd  !>>'  tlic  triuiidAtioii  luud^’ArC',  the  PB.I 
ruit-tiine  kernel  iii<uu)£e  the  coiihguiAt Lon  of  the  tiruiii- 
Lition  llA^d>^'<ue  to  niiiuiiLLie  .vldiesri  ^lults.  Like  poges  iii 
a  conmitioiial  s>'5teiii,  isegnients  arid  their  a.ssociated  <le^ 
seriptors  are  generic  in  nature.  It  a  only  tlirougli  s>'Stem 
prograiuniinK  tliat  a  segnient  ^er^'es  a  apecifie  purpose,  sudi 
as  representing  user  code  or  data  segments. 

'Ib  ilistiiiguLah  Ijctweeai  local  awl  dolial  sesmenta,  to  ar^ 
bitnarily,  but  witli  little  loss  of  flciiemlitv,  spedfv  tluat  the 
upper  5  bits  of  a  \irtual  address  Henenated  V  a  PIM  pro¬ 
cessor  indicate  tlie  -sewpe  of  tire  address,  llic  \’alue  of  the 
scope  field  iktcnuiiies  wliat  tj^JC  of  traiusLation,  if  any,  is 
used  (see  Figure  9).  Fbr  bcal  traiLslatioii,  bits  5  tlirougli  7 
are  used  as  an  index  sialue  to  select  one  of  eiglit  sets  of  local 
segment  descriptors  for  trajuslatiou  aiul  protection  cliecking. 
'Hie  rest  of  tlie  ^^rtual  address  represents  an  offset  from  the 
segment  Ijase  addresa  Unlike  tlic  table  look-up  sb'le  of  loc.al 
translation,  for  glolial  translation  it  Is  more  eillcieiLt  to  <k> 
tcnniiie  if  the  drtual  adilrcss  is  contained  wLtliin  the  span  of 
a  glolsal  segment.  Tima  if  tlie  scope  ^'aluc  uuheates  global 
translation,  a  fully-associati\'e  lookup  is  performed  using  the 
glolnal  segment  ckscriptora  Also,  <as  sIio\ml  in  tlie  figuxe,  a 
supcrsisor-lesel  untraiusLatcd  region  tliat  spans  the  excep¬ 
tion  luamUer  arldrciscs  lias  ticcn  rcser^'ed.  'Tliis  feature  is 
useful  for  kernel  code  to  run  diaguostica  such  as  \'erifyjng 
tlie  operatiou  of  the  address  trauslatiou  llard>^’are  viithout 
being  incapadtatol  bv  related  liaidware  errora 


^'i^tual  address 


Figure  9:  Address  translatkm  in  DIVA  PIMs. 


4.  KXPKRIMKN  l  AL  RKSl  l  l  S 

4.1  Applications 

'lb  measure  tlie  pcrfoniiancc  potential  of  the  ardii- 
tccturc,  TO  examine  in  detail  eiglil  beiieliniark  applications, 
sununarlscd  ul  'ilrblc  L  'lliesc  applicatloiis  span  a  licoad 
range  of  domaiiLS  including  scientific  computing,  cbtab.ases 
and  inrage  processing,  'nie>'  exliibit  both  coarsograin  paral- 
leliaii  (  whidi  aUows  computatloii  to  be  spread  across  PIMs) 
and,  in  some  casea  finogiain  parallelism  (wliicli  can  be 
exploited  tlmougli  execution  iii  tlie  WulcWord  unit).  CG, 
Neigliborliood,  Pointer,  007 and  Natural  Join  exliibit  irreg¬ 
ular  or  mixed  (regular  ajul  irregular)  data  access  patterna 
resulting  hi  liii^i  memor>'  accxss  o\’erlieads  on  conmitiowil 
ardiitecturea  Conicrtuni,  'Iraiisito'e  Closure  and  'IbinpLate 
Matdiing  are  deiLse  matrix  computations  vsitli  regular  ao 
exss  pattcxiLa  but  memor>'  Ijaiuhcidth  becomes  a  limiting 
Gtctcff  in  exploiting  aiailable  par.aJlelLsin.  'Ilicse  tluee  awl 
CG  rely  on  tlie  WkleWord  unit  to  exploit  parallelian  awl 


PIM  Ijaiulwidtlua  Hereon,  to  use  abbreriatioiLS  for  eadi  of 
the  program  luuirca  'nitli  a  sufLx  -H  for  liost  and  -P  for  PIkL 

4.2  Simiiintion  Knvironment  aiul  Paranictcn* 

'lb  eialuatc  tlie  DI\’.\  ardiitccturc,  to  dcTOlopcd  a  s>T!teiii 
suimlator  called  DSIM,  wlildi  uses  RSIM  as  a  fiameivork, 
with  significant  extcusioiLS  )24I.  PSDf  is  an  cseut-drirai 
siinulator  tliat  models  sliared-iiieinor>'  multiproccfisors  built 
with  statCMjf-tlie-art  multiple  issue,  out-oPordcr  supcxscalu' 
processora  DSIM  extcnsloiis  include  a  suiipler  PIM  pro¬ 
cessor  toUi  a  WideWord  unit,  the  DIV'A  memoiy  s.\'Stem, 
the  parcel  coiiununlcatioii  incelianism  and  tlie  PDi-to-PIM 
intCTConiiect,  DSIM  supports  Uie  full  DIVA  PDI  E.A. 

'Hie  DSIM  host  processor  is  taken  directly  from  RSDI,  iii- 
clnding  tlie  first  aiul  sccond-lcwl  caclica  'llie  liost  processor 
ardiitccturc  is  Ixusod  on  the  MIPS  RiOCXW  aiul  is  configured 
as  a  foui^ifflue  processor  mth  two  integer  arithmetic  uiiita 
tTO  floatLiig-poiiit  units  ajul  one  address  unit.  Loads  are 
iioit-blockiiig.  It  lias  a  J^Kln-tc  Li  and  a  iMb^'te  L2  caclic. 
both  tw'O-waj'  assodative,  with  aoceis  times  of  L  and  LO  c>'- 
clea  respectively.  Botli  Ll  and  L.2  cadies  are  pipelined  and 
support  multiple  outstaiuding  requesta 
'llie  Iiost  is  coauicoted  to  tlie  DD’.V  iiieiiioo’  system  via  a 
split-traiLsaction,  Oi-bit  Inta  'flic  luemoo'  svTrtcm  consists 
of  Uu?  aggregation  of  all  PIM  memories,  wlicre  eadi  local 
ineniorv'  is  visible  from  both  liost  and  local  PIM  proces¬ 
sor.  DSDI  maiiitaiiLS  tlic  current  open  low'  of  eadi  mciiuxv' 
liaiik  to  determine  tlie  memory  access  tM^e  (page  or  tawlom 
moile)  and  simulates  arUtration  between  liost  aiuL  PDI  ae- 
ccssca  as  described  in  Section  3.1.  'Ilie  memory  latencies 
seen  Ixv  tlie  liost  are  32  cveles  for  page-mode  accesses  aiul  09 
cycles  for  random  mode,  <uid  budude  tlie  bus  transfer  delav, 
tiie  memory  arbitration  tunc  and  the  DIl.\M  acceiB  Umc  (4 
ajul  12  cveis  for  page  ,and  random  mode,  respectively).  'Hic 
meiiiorv'  Latencies  seen  b>"  the  local  PDI  processor,  inclwling 
arbitration  iuid  DR.\M  acces;  tunes,  are  5  .and  i3  cvelcs  for 
page  and  random  mode  accesses,  respectively. 

.\n  application  liliraiy  supports  a  cache-luie  flusli  to  en¬ 
force  coiliexenoe  between  tlic  liost  caclics  .aiul  PDI  memorv', 
as  well  as  svaidiroiibsatioii  and  communication  fiuictiioiLS, 
'I'licse  functions  are  liiikc^l  with  tlie  application,  and  tlieir 
e.xecutioii  Ls  simulated  bi'  DSDI  in  tlie  same  wiay  as  the  ap¬ 
plication  code,  DSIM  also  models  tlie  parcel  medianism  and 
the  PDI-to-PIM  uitercQiuicct  bi  vletail,  Init  we  omit  furtlicr 
description  since  tlus  paper  focuses  on  l-PTM  performance, 
Fbr  tliose  CKperbnents,  we  make  tlic  couservative  assump¬ 
tion  tliat  tlic  PIM  processor  runs  at  lialf  tlio  speed  of  tlio 
liost  procesjor,  .\lthougli  the  inlicreiit  speed  of  tlio  logic  Is 
no  slower  T3|,  we  make  tlus  assumption  because  tlie  sul>- 
compoiients  of  tlie  PDt  processing  logic  run  bi  lock-step,  so 
the  resulting  clock  speed  Ls  slower  tlian  tliat  of  superscalar 
schemes, 

4.3  I’crformancc  Compared  Against  Host 
Figure  10  smiuiiariacs  l-PIM  porfoniianoe  as  compared 

to  e.xecution  on  the  coiiventioiual  host  procesJor,  Five  of 
the  ei^it  programs  speed  up  significantly  compared  agaiust 
licst  execution,  two  rcniabi  about  Uie  same,  and  one  pro¬ 
gram  is  slowcil  down.  (All  prograius  speed  up  vvlicii  multi¬ 
ple  PDls  are  used)  Ovciall,  the  average  speedup  is  3.3X. 
Several  fiicLors  contribute  to  tliese  speedups,  bicluduig  tlie 
bwer  memory  stall  tbiics  on  tlie  PDI  nodes  .and  the  l>ene- 
fits  of  Uie  WideWord  unit  in  exploiting  finograbi  parallelLsiii 
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Table  1.:  Application  description. 


aj;d  takuig  ad-Ntuitaec  of  pafiMiiodc  incmory  amsast  llic 
rciiiAiiidcr  of  tliif!  Rcctioii  examines  tLcsc  factors  in  detaiL 


Figure  10:  Speedups  oxer  host-only  execution. 


4.4  Reduction  in  Memory  Stall  Tiine 

Fifiurc  11(a)  sliows  the  ineuior>'  stall  times  of  host-only 
cxecutiiOiL  Pttls  rexluce  memoit'  stall  time  in  tn^o  v/Ayx 
(1)  loser  latency  bo  nremorv'i  and,  (2)  hi^icr  b.ands'kitli 
to  meiitory  tlnoujdr  s'idc  loads  and  stores.  (.V  tliird  reduc¬ 
tion  occurs  as  a  result  of  coaisc-graui  parallelism  across  the 
PIMs,  «'ludi  is  not  discussed  in  tlds  paper.)  We  sec  from  the 
figure  tlial  fiw  of  tlie  eight  programs  spend  more  tlian  40%. 
of  their  time  stalled  in  memoia-  accesses.  DI\’.\  adue\'es  a 
reduction  in  iincmory  shall  time  for  these  fi\'C  progrants  ku^;- 
ing  from  IdJStfXi  for  Natural  Join  to  35%  for  Cornerturn,  as 
sliown  ui  Figure  LL(b). 

Tlie  liost  versiou  of  llsnpLatc  ALatdiing  (TM-H)  luas  a 
memory  stall  time  of  only  ^/a  of  its  total  execution  tune. 
'Fhc  data  set  shse  of  this  application  fits  ul  the  L2  cadic.  aiul 


the  working  set  of  each  loop  fits  in  tlie  Ll  cadie:  tlierefore, 
the  data  reuse  exhibited  b>'  Tkl  Is  eflccthely  exploited.  Ewn 
tliougli  'rkf-H  docs  not  suffer  from  large  memory  st>all  times, 
the  1-PIM  ^■e^sion  (I’M-P)  lias  cwn  smaller  stall  times  due 
to  the  liiglt  data  liandwidtii  at  the  PDiI  node.  'Ilie  use  of  tlie 
WideWord  unit  for  loaduig’'storingaiiid  operating  ot;  2jti-lat 
objects,  pins  the  reu.se  of  data  in  WideWord  registers  reduces 
the  memory  stall  time  to  20%!  of  tlutt  of  'I’M-H. 

Comertum  Las  a  meiiioo'  stall  time  of  90.17%  wlieii  run¬ 
ning  on  the  licst.  lliis  application  Las  \vr>'  Uttle  tempo¬ 
ral  reuse,  since  cadi  iiiatAx  element  is  accessed  only  twice 
(one  reatl  and  one  wTite)  during  the  matrix  transpose.  'Ilius 
primarily  spatial  rensc  Ls  exploited  in  caclie,  and  each  new 
cadie  fine  is  only  reused  a  fen'  times.  Li  the  PIM  version, 
the  WiileWord  datapatlis  also  exploit  tlie  avaihable  spatial 
reuse.  Furthcniiore,  the  WiifcWord  loaibs/stores  and  opera¬ 
tions  on  8  matrix  elements  .at  a  tune  also  reduce  the  number 
of  accesses  to  memory.  Finally,  the  latency  seen  b>'  the  PIkf 
processor  (av'erage  of  11.57  c>'des,  since  most  of  the  accesses 
are  in  random  mode)  is  inudL  lower  tlian  tliat  suffered  b>' 
the  host.  '11  ic  combuuation  of  tliese  fhetors  reduce  tlie  CI’-P 
memoiy  st.aH  time  to  4.32*%  of  tluat  of  C'f-H. 

CG  also  benefits  from  tlic  loaer  inemory  latencies  on  tlie 
PIM  node.  Since  tlie  data  set  siae  does  not  fit  in  the  host 
cadies  and  tlic  Irregular  access  pattenus  cause  conflict  misses, 
OG-H  spciuls  ?i5.2L%  of  Its  execution  time  stalled  due  to 
cadie  misses.  .Vltliough  most  of  the  misses  are  satisfied  at 
the  L2  cadic  (51.32%i),  4fi^i  of  tlie  st.aU  thne  is  due  to  ac¬ 
cesses  to  the  DR.\hL  On  the  PIM,  7S%i  of  tlie  memory  ac¬ 
cesses  arc  page-mode  accesses,  aiul  tlie  .average  Latency'  seen 
by  die  processor  is  only  5.91  cycles. 

'Ilansltive  Closure  on  die  liost,  'IC-H,  speiuls  7VAi  of  its 
e.xecution  time  stalled  due  to  cadie  misses,  with  47,14%  of 
the  misses  satisfied  at  die  Ll  and  52181%  satisfied  at  tlic 
L2,  resulting  ui  an  average  mis;  Latency'  of  Ci.23  cy'dcs.  Fbr 
'rC-P,  die  average  memory  latency  Ls  5.57  cycles,  due  to 
e7%i  of  page-mo(l»c  accesses.  In  addition  to  l»vci  memory 
latencies,  'IG-P  .also  Ims  a  smaUcr  number  of  memory  ac¬ 
cesses  since  tlie  WkleWord  unit  is  used  to  transfer  tlie  data 
to/from  memory'  .and  perform  the  computation,  Tlie  use  of 
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tlie  WklcWord  unit  results  ul  the  added  benefit  of  exploiting 
Spatial  lense,  since  tlic  inatrbc  Is  accesned  with  stride  one  in 
tlic  KPK-  dimeasioju 

Meiglihorliood  sliou-s  on  increase  in  memory  stall  time  be¬ 
cause  tlie  data  fits  in.  cadie,  and  thus  the  memory  latency 
at  the  PIM  Is  larger  tlian  tlat  of  tlie  host  'Hie  increase 
ill  lucinOTk'  iitaU  time  plins  the  fact  tliat  tlic  PIM  processor 
rujLS  at  luolf  the  speed  of  tlic  liost  result  in.  a  slowdouii.  uitli 
respect  to  liost-oiily  execution. 

Pointer  Las  uo  spatial  reuse  and  little  temporal  reuse,  aiul 
since  tlic  data  set  siae  Ls  larger  tluui  the  L2  codic,  P-H  stalls 
for  incinoo'  for  ■iStSXi  of  its  executioii  time,  writli  most  misses 
satisfied  at  tlic  DIL'M^L  P-P  las  rougjdy  the  some  iimnbcr 
of  Iflrtuls  aiwl  stores,  Imt  tlic  a^rage  latenc>-  seen  by  the 
PDI  is  mudi  smaller  tlan  the  mcmor>'  latency'  sufleiwl  la' 
the  liost,  c«n  thougli  most  of  the  PDiI  acesses  are  laiulom- 
mode  accesses. 

Matuial  .Join  ediibits  little  temporal  reuse  and  liigh  cadic 
miss  rates,  e\en  thougli  the  data  set  size  fits  in  tlie  L2  cadie. 
N.I-P  fiions  a  reduction  of  13.£Wi  in  memory  stall  times  due 
to  tlie  Lower  average  lateaicv-  seen  !>>'  tlic  PIM  processor, 
oefi"  also  las  ahiiofit  no  temporal  reuse  and  007-H  suflers 
from  a  large  amount  of  cadic  misses.  On  the  PIM  version 
tlie  memory  stall  time  is  reduced  t]i>'  agivin  as  a  result 

of  tlie  smaller  onrdiip  Latciic>'. 

4.5  Benefits  from  NMtleWord  and  Page  Mode 

'Ib  isolate  tlie  benefit  of  tlie  WideWord  unit,  wx  compare 
scalar  v'Cr.sioiLS  against  versions  tuned  to  take  advantage  of 
tlie  WideWord  unit  aiul  page-mode  meiiioo'  accesses  for  the 
four  programs  tlat  ntillae  Uie  vviide  data.patlis.  Tliese  results 
ore  sliowiL  in  Figure  12.  Speeclups  are  significant,  raiigiiig 
from  LlOX  for  CXI  up  to  17.9GX  for  TM,  with  an  average 
improvement  of  9.SJ3X.  The  features  of  the  ULStruction  set 
tLat  are  exploited  are  summartsed  in  the  final  column  of 
'Riblc  1,  aiul  described  as  foUowis. 

'rM  computes  tliiee  correlation  values  between  an  image 
and  cadi  of  32  templates,  each  correlation  corresponding  to 
a  loop  nest.  Tlie  DIN'.V  implemcntatiou,  wdudi  Is  described 
in  detail  in  tales  advantage  of  tlie  inherent  fine-grain 
parallchsiu  by  operating  on  32  b-bit  uiiage  pixels  and  32 
^•bit  template  dements  at  a  time.  Since  a  template  is  rep¬ 
resented  as  a  32-by-32  matrix  of  S-bit  elements,  an  entire 
template  row-  fits  into  ote  WideWord  register.  .Msoi,  since 
the  iiuicnnost  loop  traverses  one  template  r<Kv,  the  entire 
inner  loop  computatioji  is  tiaiLsformcd  into  a  sc<iuciice  of 
WkleWord  operattons  on  one  template  row  and  32  pixels 
of  an  image  row-,  efrcctivcly  eUmiiiating  tlie  iniierinost  loop. 
'Ilie  accumulation  of  Uie  pixel  values  Ls  adueved  W  a  parallel 
reduction  sum,  using  permutation  opcratioiLS  as  in  Figure  G, 
and  tlie  result  of  the  reduction  sum  is  added  to  the  correla¬ 
tion  value  using  sclcctlv'c  cxecutioii  as  in  Figure  b.  To  exploit 
temporal  reuse  in  WideWord  registers,  wx  appUed  common 
loop  traiLsforiiuatioiis,  particuLarly  uiiroll-iuidrj^ii  4|.  In  ad¬ 
dition,  wx  e.xploitftl  spatial  reuse  by  sliiftiiig  an  image  sul>- 
row'  held  in  a  WideWord  rcglsLer  by  one  pixd,  to  movx  the 
window  of  the  image  to  be  compared  .against  tlie  template. 
Fhrtlicr  performance  improvements  are  obtained  b>'  reorder¬ 
ing  memory  accesses  ami  grouping  streaming  accesses  to  the 
dense  ana,vs  to  adiievx  pagomode  memory  access  Latencies. 

Tlie  CT  tmplciiientatioii  performs  a  liierardiical  in-place 
matrix  timtspose  vviicre  the  smallest  submatrices,  of  size 
tSxJS,  are  transposcvl  in  WideWord  registers.  Each  tSxiS  sulc¬ 


al  Busv'  and  iiiemorv'  stall  times  for  Imst-oady  c.xccutiiuiu 


b)  Host-ouly  and  1-PIM  memoix’  stall  times. 


Figure  11:  Memory  stall  times. 


Figure  12:  Speedup  of  WideWord  vs.  scalar. 
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Hiatrix  is  Icsvlol  into  the  \\'i<k)’\Vord  resstcr  fils'  (aii  SsS  iijiv 
trix  witk  ;ia-l>it  clnnciits  roquiiiiig  8  WidcWord  registers), 
and  transposed  via  a  seqrjjaicc  of  pcnnutatioii  Oipcxatioiis. 
'Ilic  traiLSposed  sulnnatrix  is  Uien  stoired  Ixtck  in  ineiiiory. 
'Iliis  uijplfiinentatioii  takes  ad^ajitafie  of  tlie  largo  capacitv 
of  the  WukWonl  register  flic,  avoiduig  loads  and  stores  to 
memory  during  tlic  transpose  of  cadi  8SS  suhiivatrix. 

OG’s  ke^'  computatiou  Is  a  sparse  inatrix-TCCtor  multiply. 
Due  to  tire  mixed  regular /'irrcgiilar  nature  of  data  accesses, 
we  only  exploit  fine-grain  parallelisin  in  the  WkleWord  luiit 
for  tlic  reguLar  portions  of  tlic  computatioiu  Tlic  deruse 
vector  accesses  are  loaded  into  WideWord  registers,  aiul 
tlic  deiuse  vector  multiplies  are  performed  in  tlic  WuleWorcl 
floating-point  unit,  llie  acemnuktes  into  the  sparse  matrix 
.are  performed  sequeirtially.  Selective  execution  Ls  used  to 
select  the  field  of  the  WideWorJ  operatrd  tliat  participates 
in  the  opcratioiL  .is  in  'ITvI,  we  also  reordered  meiiiorv'  ae> 
cesscs  to  adiieve  page-movie  latencies  on  the  dense  arra.vs. 

'rC  uses  a  deausc  matrix  to  represent  the  distarree  grapiu 
It  exploits  furc-grain  paralleLLsrn  bo'  peafonnirrg  WulcWord 
aritlunetic  operations  on  ciglit  32-hit  dements  of  tlie  matrix 
tliat  are  liekl  in  WideWord  registers.  Selective  execution 
nsirig  WideWord  operation  mtrrgoc  merges  tire  contents  of 
twe  WideWord  registers  accorvlirig  to  corrditiori-code  hits, 
.allovvirtg  an  efficient  cornputatiori  of  tlic  rninimum  value  of 
each  pair  of  elements  of  two  WideWord  opeiaiuls.  Similar 
to  'llvl,  we  use  lurroU-and-Jarn  to  oirtain  temporal  reuse  in 
tlie  WideWorxl  register  file. 

5.  sr.vns 

Tire  first  DIV’.V  PEvI  prototvpe,  sliowai  in  Figure  1,  is  an 
SR.VM-l)a.sed  singlc-uovle  irirpkment.ation  of  tlie  DrV'\  PIM 
cliip  ardiitecture  and  is  currently  in  test,  'lb  minimise  sili- 
oon  area  of  tliis  SR-VM-lstsed  prototv-pc,  we  used  a  l-Mhvi-e 
memory  macroi.  For  comparison,  a  DR.\kl-l)a.scd  implcincai- 
tation  with  a  ^-Mln'tc  macro  could  Irc  falxricatcd  in  approx¬ 
imately  lialf  tlie  area  of  the  Sri.VM-ha.sftl  protobTC.  'Flic 
current  protot>'pe  cliip  implements  all  features  of  the  DIV'.V 
PEM  ardiitecture  except  address  transkti04i  and  floating¬ 
point  capaliilitics.  .V  second  version  of  a  PIM  diip,  vvliidi 
not  only  integrates  tliese  functions  hut  adiieves  a  faster  clock 
rate,  is  vine  to  tape  out  iii  tlie  second  lialf  of 

Tlie  current  ddp  was  falxricsatftl  tlirougli  AI06IS  in  'I’SMC 
Otibrnu  tcduiology,  aiul  tlie  siUeoii  vLie  measures  9.8iiua  oiia 
side.  It  contains  approximately  2  iiiiUion  logic  transistors  in 
addition  to  tlie  53  miUion  transistors  tlmt  impleineut  8  hO>its 
of  SR.ViL  Hie  diip  also  contains  352  pads,  240  signal  I/O, 
and  is  packaged  in  a  35uuii  BG.V.  Much  of  tire  logic  was  rw'u- 
tliesiaed  vvitii  Sviiopfiv-s  Design  .Vnalvaer,  aiul  tlie  entire  diip 
was  placed  ajul  routed  vv'itli  Cadence  Siltcoii  Eliiseunhle.  Tlie 
IP  InrikUng  blocks  used  in  tluc  diip  include  .Vrtlsan  standard 
cells  iuixl  register  files,  Virage  Logic  SILVM.  aiul  a  NurLogic 
PLL  dock  multiplier. 

Tlie  diip  Is  currently  being  tested  for  functioriality  with 
tlie  use  of  paltcrii  generators,  wiiieli  apply  test  vectors  to  in¬ 
put  piiLS,  and  logic  aiialv-zer  modules,  vviiidi  sertsc  tlie  out¬ 
puts.  .'VlthouglL  exluiustive  testing  lias  not  .vet  been  com¬ 
pleted,  tlie  ddp  is  correctly  executing  at  ICOMHz  on  the 
Comertum  matrix  transpose  kernel  describcvl  in  Section  4, 
exorcising  all  irrajor  control  and  datapaths  within  the  PIM 
processing  logic,  including  the  WideWord  permutation  mut. 
Even  in  tliis  liuuted  test  setup,  Uie  cldp  perf<]irm.s  1.3s  COPS 
wiiile  dissipating  only  SOOinW.  In  acldittoii  to  tlie  process¬ 


ing  logic  funelioiuthty,  correct  opcraticai  of  parcels  transiting 
tlirougli  tlie  PiRC  luas  .also  been  vcriflevL 

We  will  soon  begin  integrating  PDI-lrasftl  DIMMfS  into  a 
workstation-dass  development  s>'Stein,  iiicorporatuig  coiit- 
pUer  and  sv-stein  software  teduiolog>’,  fW  tlie  liost  operating 
sv-stein,  we  liave  augmented  Linux  to  include  PIM-spcciflc 
support.  Such  as  loading  PDI  code  and  data,  booling,  pro¬ 
cess  imviiagement,  arul  tire  memory  management  functioiLS 
outlined  in  TO],  We  liave  also  developed  components  of  tlic 
PIM  ruiL-Ume  kcnicl,  an  augmented  version  of  tlie  RTEIMS 
open-source  real-time  embcvklcvl  operating  system.  We  liave 
devebped  a  prototv'pe  compiler  for  tlie  DI\5V  E’DLs,  widdi 
takes  as  input  scc{ueutial  Forlran  or  C  code,  and  provinces 
DIV’.V  executables  tliat  e.xploit  both  the  scalar  arul  Wide- 
Woed  unit.  We  leverage  the  SUIT  compiler,  kiduding  ex- 
teiLSioais  dcsoibevl  in  jUE]  and  our  own  iinpkMneait.atian  of 
traiLsforiiiatioiLS  described  in  Xj],  and  a  GOC  Ijaekeinl  for 
the  PowerPC  AltiVec,  .V  systeiiirbvel  compiler  is  an  area  of 
future  worL 

6.  RKI.ATKDVVORK 

'lire  DIV’.V  sj^stciu  ardiitcctuie  is  focused  on  adiieviiig  tire 
foEow'ing  four  goals:  (1)  developing  PDIs  tliat  can  serve  as 
the  ordy  memorv"  ui  tlie  system,  .assumiiig  the  dvral  roles  of 
“'Smart  memories”  andcoaiveiitional  meinorv':  (2)  supporting 
a  wide  range  of  familiar  programming  paiadigiiis,  closely  re¬ 
lated  to  parallel  computing;  (3)  t.aigetuig  applications  tliat 
are  severely  Lmpactftl  by  Uie  pirocessor-rnemiorv’  l>ottlenecks 
in  converitioiral  svstems:  sparse-iiiatrix  and  pointer-based 
applicatioius  with  irregular  memory  access  patteriLS,  and  im¬ 
age  and  video  applioatioius  witli  large  working  sets:  amk  (4) 
devebping  a  VtSI  device  to  c.xpbit  rnemorv"  and  ooiiunu- 
nicaUoiLS  Irandvvidtli  in  PDI-lrascd  svstems  wiiile  making 
cffideait  use  of  oii-cldp  resources  for  target  applicatioiLS. 

'Ilrese  four  goals  distingulsli  DIV’.V  from  other  PDl-lrased 
ardiltcctures.  Integration  tiito  aconveiitbiud  svstem  aflonls 
the  simultaneous  Ircnefits  of  PDI  tcchirologv'  and  a  state-of- 
the-art  liost,  yiekliiig  liigli  peifoniranoc  for  mixed  workbads. 
Since  PIM  pnoceasors  arc  usually  less  sophisticated  due  to 
oii-diip  space  constraiiils,  sv'Steius  using  PDIs  alone  ui  a 
multiprocessor  may  sacrifice  performance  on  imipcocessor 
coinputatiotrs  '12,  IG,  25,  27],  vvliib  svutcm-ou-a-clu.p  solu- 
tbirs  (e.g.,  the  HLVM  |22]  and  tire  Mitsubislii  M32II/D  '20j) 
limit  the  applicatiari  (biiiaiiu  DIV’.Vs  support  for  a  broad 
range  of  familiar  par.allel  programming  p.yradigms,  including 
task  parallehsm  for  Lrrcguhvr  cornputaliiOiLS,  distinguishes  it 
from  sv'Steius  with  restricted  applicability  (such  as  to  SIMD 
paiaUehsm  T,  8,  as  well  as  tliose  rerfuiringa  novel  pro- 
graiiuniiig  metlioiblogv'  or  compiler  tecliirologv'  to  configure 
bgic  )1|,  or  to  inanage  a  complex  memorv',  computation  and 
coiiununication  luerarchy  ’L5|.  DIV'.Vs  PDI-to-PDI  tiitcr- 
coimecL  iinproves  upon  approadics  tliat  sciialiae  commu¬ 
nication  tlirougli  tlic  host,  vviiidi  decreases  liaiulvvidtli  Iw 
aiklirig  traffic  to  Uie  preKessor-memony  Ixus  18,  211. 

With  respect  to  DIV’.Vs  WuleWocd  unit,  Talde  2  com¬ 
pares  tlic  features  describcvl  in  Section  3.2  with  two  com- 
meicial  multuiKeUa  cxteiusloiis  tliat  support  supesrword  iw- 
aUehsm,  PowerPC  .VlUV’cc  and  Intel  S^2,  as  well  as  a  pre¬ 
vious  researdi  design  callftl  .VS.VP  2].  (Most  otlier  multi- 
media  cxieiisie>us  support  fuiword  parallelism,  wiiidi  per¬ 
forms  paralbl  opeiatioas  on  subfiekLs  of  a  maclunc  vvonL) 
Tlic  AS.VP  ceMnbiues  WideWord  and  scalar  eapabilitics  in 
a  single  unit.  'HlIs  approadi  eliminates  tlic  need  for  tiaiLS- 
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ill  ineiaOf>'  stall  tiiitC',  ajid,  for  four  of  tlio  programs,  an  av- 
crago  speedup  of  9.‘J3X  <lii£'  to  fiiic^graui  parallelism  hi  tlic 
WiileAVord  unit  as  eompaml  to  scaLar  PIM  eeceerutioai.  As  a 
result  of  these  effects,  six  of  tlie  prograiius  sliow  fairly  sigiiif- 
icajit  speedups  o\'er  liost-otily  cxeeution  with  just  oiie  PIM, 
e^'en  tliougji  tJie  PIM  processor  is  au  iu-order,  siiiglo-issuc 
processor  nuuiiiig  at  lialf  tliC'  speed  of  tlic  liost,  wiildi  is 
au  out-of-ornler  4-issue  processor.  'Iliese  L-PIM  speedups 
suggest  DIV’.Vs  potential  to  outperform  OQjp'eiitioiual  multi¬ 
processors  for  certain  applicatioirs,  aiul  at  a  iimch  loluced 
liardw-are  cost 


fers  l)eh\eeii  register  files,  Init  with  register  forwarding,  it 
cau  complicate  the  pipeline  and  slow  cIohml  the  clock  rate. 
.U1  other  implemcaLtations  Ijave  separate  scalar  aiul  Wide- 
Word  units  aiul  register  files,  aiul  other  tliaii  DIV'.'V,  only 
SSE2  Bieludes  trauirfers  between  register  files.  The  al«eiu:e 
of  sueli  capaliility  was  reported  to  I*  a  performance  bottle- 
iu!ck  in  the  .MtiVec  TSJ.  AltiV’ec  and  AS.\P  support  only 
general  permutatioris,  wiicre  permutation  sectors  are  read 
from  menroo'  or  construeted  W  htstructlons.  Both  SSE2 
and  Dr\'A  cau  avoid  tlicse  costs  of  deriving  a  pcnuutatioii 
vector  tliroudi  liardwired  penuutatiou  opcratioiLS.  In  the 
case  of  SSE2,  penuutatiou  opciatioiLS  cau  oidy  be  expressed 
tlirou^i  iiiunedLatcs,  so  the  penimtatioiL  must  be  kiiowiL  at 
compile  time.  DI\'\'s  luardwired  penuutatiou,  wiiidi  is  ul 
ad<litioiL  to  general  pennutatiou,  is  indirect  becaase  it  ref¬ 
erences  a  scalar  register.  Hardwired  indirect  permutations 
arc  more  powerful  tlian  immediate  pcrmutatioiis,  in  tliat 
we  can  use  lucarbv'  pennutatioiLS  for  different  iteratioius  of 
a  loop  witliout  requiring  unrolling  (e,.?.,  to  do  aligiuiient). 
DIV.V  pTO\idcs  a  detailed  referciux  design  and  implemeiita- 
tioi;  of  selectiw  execution,  related  to  the  coauxpt  discussed 
ill  31,  tliat  supports  selectiTC  CMCutiou  in  almost  e\'eTy  wide 
uistructlon.  By  comparison,  since  the  .UtiN’ec  docs  not  in¬ 
corporate  selecti«  execution  ofaritlimetic  operations,  to  ac- 
coinplislL  the  same  result  as  in  Figure  S  on  tlu>  .UtiV'ec  would 
require  au  additional  instruction  to  coiiunit  only  those  fields 
of  the  result  of  tlie  add  for  wiildi  tlie  corulitiou  code  is  set. 

We  further  cousiiler  a  performance  comparison  witli  the 
PowerPC  .UtiVec  74XX-  Ewu  with  a  ^^crJ'  aggressbe  DR.\hI 
tcchnologv’,  the  74XX  cai;  adiiea'e  a  peak  main  memory 
Isuulw^ultli  wi  licit  is  only  one  tltird  tliat  of  tlie  PIM  DR.\hL 
Wltile  tlie  74XX  lias  better  Ixvubvidtli  for  problems  wiiiclt 
fit  into  tlie  2j4KB  ourdiip  L2  caclie,  for  our  benelunarks 
witli  liifffi  memory  stall  tuues,  a  single  DIVA  PIM  processor 
will  outperform  tlic  .UtiVee  despite  a  much  smaLlcr  transis¬ 
tor  count  on  a  DIVV  PBL  Furtlier,  since  eadi  DIVA  satj- 
tem  w'ill  include  many  intercoiuiectc^lPIM  cliips,  tlie  perfoi- 
inance  iuhaiitage  will  scale  with  increasing  memon'  ske  for 
problems  amenable  to  coarsc-graiu  parallel  computatiosi. 

7.  CONCIASION 

Tills  paper  lias  presented  a  detailed  cfcscription  of  the 
DIV'A  PBI  microardiitectnre.  We  discui®  some  of  tlie  issues 
tluat  must  be  cousklered  in  future  ardiitecturcs  for  exploit¬ 
ing  memor?"  Ixuidwidtli,  particularly  the  memory  interface 
and  controller,  instruction  set  features  for  fine-graiii  paral¬ 
lel  operations,  and  mcdiaiiLsms  for  address  tnanslatiom  Wc 
present  sunulatioii  results  on  ei^it  programs,  demonstrat¬ 
ing  an  a\'etage  speedup  of  3.3X  as  compared  to  a  cota-eii- 
tional  host.  'Hie  speedups  are  due  to  up  to  90%  reduction 
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Iiiipleiiieiitatioii  of  a  32-bit  RISC  Processor  for  the 
Data-Iiiteiisive  Architecture  Processiiig-Iii-Memory  Cliip 


Dmpci.  Jclf  Soiidceii.  Suiiiit  Mediratta.  Dui  Ivuii 
of  Soutlicni  California  Iijfonualioii  Sciences  Institute 
diuporuda.edu.  soudeenu  ia.edu.  suniitiii'udsd.edu.  ihiLkuuse.edu 


Abstract 

Tfic  Data-Intensiiv  Ardsiteciurc  {DIVA}  ^iem  cmplc^s  Pv(xxssin/^In-A\Ienwri/  (PIM) 
claps  06  snutji-nvenvori/  coprocessors  to  a  Tmcvoproccssor.  TIds  ardtiteciure  cjploits  jnl^erent 
nvemory  bcaidwidth  both  on  and  ocross  the  si/stcm  to  target  seiwal  dosses  of  bandmdth- 
tii7^ed  explications,  induding  moltimedio  applications  and  pointer-based  and  sparse-matrix 
computations.  Jlte  DIVA  project  is  budding  a  prototgpc  uorkstation-class  sgstem  using 
PIM  chips  in  place  of  standard  DRAMs  to  demonstvate  these  concepts.  U'e  have  vxentlg 
completed  initial  testing  of  the  first  version  of  the  prototype  PLM  device. 

.4  key  component  of  this  architecture  is  the  scalar  processor  that  coordinates  all  activ- 
itg  witlun  a  PIM  node.  Since  such  a  co77iponcnt  is  present  in  each  PIM  node-,  we  exploit 
paralldist77  to  aclTieiv  significant  speedups  rather  than  relying  on  costly,  lagh-performance 
processor  design.  Tl^e  resulting  scalar  processor  is  then  an  in-order  d2-bit  RISC  777icrocon- 
troller  that  is  extremely  area-efficient.  This  peper  details  the  design  and  miplerriaitaiion  of 
this  scalar  processor  m  TSMC  O.Idprn  technology.  In  conjunction  with  other  publications, 
this  paper  demonstrates  that  mpressiiv  gains  can  be  aclderKd  with  very  little  ‘"smarP'  logie 
added  to  mer77ory  dexices. 


1  Introduction 


The  intTcasinfi  fiap  between  protxaser  and  memorj’  speeds  is  a  problem  in 

computer  architecture,  ■with  peak  processor  perfermanoe  incTeasdn^;  at  a  rate  of  oO-(?0%  per 
year  while  memorj'  aecxsss  times  inipro>x'  at  mercl.v  b-7%.  Rirthermore.  techniciues  desdfiiied 
to  hide  niemoiy  latencir'.  sueb  as  muldthroadinfi  and  prcfetchii^;.  actuall}'  increase  the 
memerj'  bandwidth  reciuiremcnts  [2].  A  recent  \XSI  technolofij'  trend,  embedded  DIl.\M. 
offers  a  promising?  solution  to  btiehdnfi  the  proces!Soi>meiner>'  gap  [9].  One  application 
of  this  teclmole^j'  Intcffates  leflic  with,  memory'  in  a  piecx’ssiu^'-m-me’merj' 

(PIM)  cliip.  Because  PIM  internal  processors  can  be  directl.v  oonneeted  to  the  mcnioiy 
banks,  the  mcnioiy’  bandwidth  is  dramadcaUj'  increased  (with  hundreds  of  gigabit /second 
afisregatc  bandwidth  available  on  a  ebip — ^up  to  2  orders  of  magnitude  ever  cemendonal 
DIL\M  sysuuns).  Latency'  to  on-ebip  logic-  is  also  reduced.  dcwTi  to  as  Utde  as  one  half 
that  of  a  conventional  mtunoiy'  syistcnr.  because  internal  mcuncay'  accesses  avoid  the  delay's 
associated  with  communicating  off  chip. 

The  Data-Intcnsdve  -\rcbitoctuie  (DWA)  project  leverages  PIM  teclmelogj'  to  replace  or 
augment  the  memory'  system  of  a  oonventlonal  workstation  witb  ‘ianart  memories''  capable 
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of  ^XT^'  larf5(;  amoimts  of  proocssdng.  S>’Btcm  bandwidth  limitations  arc  thus  o^'crc!omc  in 
three  walk’s:  (I)  tight  coupling  of  a  sdnfdc  PI^I  proocasor  ^ith  an  on-diip  memory'  bank:  (2) 
distributing  multiple  protxsssor-momorj'  nodes  per  PIM  chip:  and.  (3)  utilizing  a  separate 
chip-to-chip  intcreonnoet  that  allows  PBI  chips  to  tx>nuuunicatc  without  inUTfering  with 
host  memorj'  bus  traffic.  Although  suitable  as  a  general-purpose  cx>mputing  platform.  DI\  A 
qiocificalhi'  targets  tw’o  important  classes  of  applications  that  are  scticrcdv  pedbrmanoe  lim¬ 
ited  ]jy  the  proocssor-mcmoi^'  bottlcmecks  in  oomtuudonal  s>’stcms:  multimedia  processing 
and  appheadons  with  irregular  data  accesses.  Multimedia  applications  tend  to  haw  httle 
temporal  reuse  [12]  but  oflen  codiibit  sivttial  locality'  and  both  fine-grain  and  ccarscvgrain 
IxaridleUsm.  DWA  PIMs  c'Kploit  qiatial  localitj'  and  finevgriun  ixariUlehsm  by  acxessing  and 
operating  upon  multiple  wcids  of  data  at  a  time  and  exploit  ccarscvgrain  iwidleiism  by' 
spreading  independent  eompuutions  aacss  PIM  nodes.  Applications  with  irregular  data 
accesses,  such  as  spiusevtnatrix  and  pointei^based  computations,  perform  poorlj'  on  con¬ 
ventional  architectures  because  they  tend  to  Lack  sivitial  locahty  and  thus  make  poor  use 
of  caches.  .As  a  result,  thdr  exocutioa  is  dominated  by'  memory'  stalls  [3].  DINA  accelerates 
such  appheatiens  by'  eliminating  much  of  the  traffic  between  a  host  processor  and  memooy'; 
simple  operations  and  dercfeacncing  can  be  done  mostly'  w'ithin  PIM  meruories. 

Performance  ewaluatioin  of  many'  appheations  has  shown  that  a  DIAA  platform  provides 
significant  spccdups.  iTrcse  results  as  wcU  as  thorough  descriptions  of  sy'stem  architecture 
issues  haw  appeared  in  previous  papers  [5.  G.  7].  iUso  ineiuded  in  previous  publications  arc 
comixirisons  to  other  PDI  architectures  as  wcE  as  ecnvcntional  architectures,  lliis  ixvpcr 
focuses  on  the  nutToarchitccture  design  and  implcmcutation  of  the  scaUtr  proeesssor;  or 
mieTocontroUcr.  that  ecordinates  tdl  activity'  on  a  DISA  PIM  node.  I>uo  to  area  censtraints. 
the  design  goal  w'as  a  relatively'  simple  proexasor  with  a  coherent,  wcll-desagncd  instruction 
set,  for  w'hich  a  gctvlike  ccmpileT  is  being  adapted.  The  resulting  scalar  proexssor  is  a 
RISC  processor  that  supports  single-issue',  in-order  execution,  with  32-bit  instruexions  and 
32-bit  addresses.  Its  novelty'  lies  in  the  special-purpose  functions  it  supports  to  interface 
to  other  exueial  components  of  the  DRA  design.  The  protxasor  w'as  fabricated  as  part 
of  a  DIAA  prototype  chip  in  TSMC  0.i8/<m  technology  and  is  currently'  in  test.  The 
remainder  of  the  paper  is  organized  as  follows.  Sections  2  and  3  present  an  overview  of 
the  DIAA  system  architocturc  and  micToarchitccturc.  to  put  the  scalar  proexssor  design 
into  its  proper  cx)ntcxt.  Sexxion  4  describes  the  scalar  processor  mieroarchitoexure  in  detail. 
Section  o  presents  details  of  the  fabrication  and  testing  of  the  scalar  proexssor  as  part  of  a 
PIM  chip,  and  Section  G  concludes  the  iiaper. 


2  System  aixliitectui^  overview 


A  driv'ing  principle  of  the  DIAA  sys'tcm  architecture  is  efficient  use  of  PIM  uxhnolog>' 
w'hile  rofiuiring  a  smooth  migration  path  for  softw'are.  dliis  principle  demands  integration 
of  PIM  features  into  conventional  systems  as  seamlessly'  as  i>ossiblc.  jAs  a  result.  DIAA 
chips  are  designed  to  resemble  eonmiercial  DILAMs.  enabling  PEvI  memory'  to  be  accessed 
by'  host  softw'are  as  if  it  w'ere  conwntional  me'moiy'.  In  Figure  i.  wx?  show'  a  snuxU  set  of 
PEvIs  cormected  to  a  sin^dc  host  processor  through  conventional  memory'  control  logic. 

Spawning  computation,  gathering  results,  sy'nthronizing  activity',  or  simply'  accessing 
uon-local  data  is  accomplished  via  ixuceJs.  .A  parcel  is  closely'  related  to  an  activx'  mes¬ 
sage  as  it  is  a  relativxly'  Lightweight  communleaticn  mechanism  containing  a  referenw  to 
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PIM-to-PIM 

Inlei'connecl 


Figure  1.  DIVA  system  architecture 

a  funcdon  to  be  im'okod  when  tlic  paixx;!  is  received  [1-4].  Parcels  are  distiitfiuisliod  fix>ui 
aedw  mesBafies  iu  tliat  tlic  destinadon  ef  a  parool  is  an.  object  in  menioiTi’:  not  a  specific 
proex'ssor.  Fbom  a  pro^ipunnier's  viw,  patoels.  tof^etlier  wtb.  the  ^^obal  address  space  sui>- 
ported  in  DWA.  provide  a  cwmproniise  beween  the  ease  of  profirauuninf?  a  sLared-memer>’ 
Sj^stem  and  the  architectural  sdniplidtj'  of  pure  message  ivaasdn^;.  Piux.x>ls  arc  transmitted 
through  a  separate  PEvI-to-PIM  inteownnect  to  enable  oonmiunication  ivithout  interferhts 
with  hoeJt-mcnior>’  traffic,  as  shown  in  Figure  I.  E>etails  of  this  interoonnect  maj'  be  found 
in  [ii],  and  more  details  of  the  s^jt-tem  architecture  uia^'  be  found  in  [o.  ti.  7]. 


3  Micix>aixbitectui'«  overview 


Eiach  DBA  PIM  chip  is  a  \XSI  meiuorv’  device  auguicnuxl  with  fieneral-purposc  com- 
pudng  and  nctworldng/cvninmnitvxtion  hardw'are.  iUtheugh  a  PIM  maj'  consist  of  multiple 
nodes,  each  of  w'hicL  are  piiniaritv'  cmiprised  of  a  few'  megabytes  of  nicmoiTr'  and  a  node 
proex'ssor.  Figure  2a  show’s  a  PEM  with  a  single  node,  w’hich  reflects  the  focus  of  the  initial 


a)  Chip  orgaiii/atioii 


b)  Microphotograph  of  die 


Figure  2.  DIVA  PIM  chip 

research  that  is  being  conducted.  Nodes  on  a  PIM  chip  share  a  sin^dt-'  PEM  Routing  Com¬ 
ponent  (PiRC)  and  a  host  inUTfaoe,  The  PiRC  is  responsible  for  routing  parcels  bctw'ecn 
on-chip  parcel  buffers  and  ncydiborhb?  off-chip  PiPCs.  The  host  inUTfaoe  implements  the 
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JEDEX'  standard  SDILVM  protocol  [10]  so  tLat  incmor}'  accesses  as  wll  as  parcel  actmtjr' 
initiated  by  tlic  host  appear  as  conventional  memory"  accesses  fiwni  the  host  perspective. 

Fy^ure  2a  also  shoe’s  t^^'o  interconnects  that  span  a  PEM  chip  for  information  flow  bc- 
tw'ccn  nodes,  the  host  interface,  and  the  PiRC.  Elach  intcroouncct  is  distiiy^uishcd  by  the 
type  of  information  it  carries.  The  PEM  mcmoiy  bns  is  used  for  conventional  mcmoiy'  ac- 
tesscs  from  the  host  processor.  ITxc  parcel  intcrcennect  allows  parcels  to  transit  bctw'ecn 
the  Lost  interface,  the  nodes,  and  the  PiRC.  Within  the  best  intcrfaec.  a  lekreel  bnfftr 
(PBbT)  is  a  buifer  that  is  mcmoty'-mappcxl  into  the  host  processor's  address  spaoc.  iki^ 
mittin^;  application-level  cemmnnieation  throu^i  parcels.  EacQi  PIM  node  also  has  a  PBUT. 
meniorj'-mappod  into  the  node's  local  address  space. 

Fh^ure  3  show's  the  major  control  and  data  connections  w'ithin  a  node.  lEe  DI\'A  PBI 
node  proccssins  logic  supports  sdn^^c-issue.  in-order  c'Mccutiem  with  32-bit  instructions 
and  32-bit  addresses.  There  are  twe  datapaths  w'hoec  actions  arc  coordinated  hy  a  sdngle 
cKetnitien  eentrol  unit:  a  32-bit  scaUtr  datapath  that  iM-rferms  oixratiens  siuiilar  to  these 
of  standard  32-bit  integer  units,  and  a  2o(>-bit  WideWord  datapath  that  performs  fine- 
grain  iwtrallel  operations  on  1(3-:  or  :J2-bit  operands.  Both  datapaths  execute  from  a 
sdnfdc  instruction  stream  under  the  control  of  a  single  o-stage  DLX-lilac  i>i]x'line  [8].  The 


Memon'  Port 


WideWord  Datapath 
tRe);tster  tile,  ALL',  etc) 


Scalar  Datapath 
(Register  File.  ALL',  etc) 


Memon  Bus 

L 

Instruction 

Pipeline  Execution 

Arbiter 

. A 

r 

Cache 

4- 

Control  I'nit 

Parcel  BulTer  (PBl'F) 


Addross/C'ontrol 

Data 


Figure  3.  DIVA  PIM  node  architecture 

instruction  set  has  been  destigncxl  so  both  datapaths  cun.  for  the  most  ixirt,  use  the  same 
opcodes  and  condition  codes,  generating  a  large  functional  overlap.  Each  datapath  lias  its 
ow-n  independent  general-purpose  register  file.  32  32-bit  registers  for  the  scalar  datapath 
and  :J2  250-bit  registers  for  the  WideWord  datapath,  but  gtocial  instructions  ixmiit  direct 
transfer  between  datapaths  without  going  throu^^  memory.  Although  not  supported  in 
the  initial  DI\'A  prototype,  floating-point  extensions  to  the  WideWord  unit  will  be  provided 
in  future  systems.  In  addition  to  the  execution  unit  and  associated  datapaths,  each  DR'A 
PDI  node  contains  ether  easentiid  components  of  note,  EkcscTiptious  of  these  components 
as  wen  as  the  WideWord  datapath  will  appear  in  future  publications. 
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4  Micix>aixliit©cture  details  of  tlie  DIVA  scalai*  pixicessor 

The  oembination  of  the  cxocution  control  unit  and  sc-alar  datapath  is  for  the  most  part 
a  standard  RISC  prootssor  and  smts  as  the  DR'A  scalar  proccsBor,  or  niierocontroller.  It 
coordinates  all  aedvitj'  •ftdthin  a  DINA  PEtl  node.  ineJudmp;  SDvlD-likc  operations  in  the 
WidoWord  datapath,  interactions  bewren  the  scalar  and  WideWord  datapaths,  and  parcel 
communication.  To  awid  sjnchronization  ovxrhead  and  compEtr  issues  assocdatexl  wth 
cioproexssor  desisns  and  also  desd^  complexity'  assocuatod  wth  supersetdar  interlocks,  the 
DI\ A  scalar  processor  was  dcsih^wxl  to  be  tifditly  inJiefirated  wth  other  subcomponents,  as 
dcscribod  in  the  previous  section.  This  eharacteristic  led  to  a  custom  design  rather  than 
augmenting  an  off-the-shelf  embeddod  IP  core.  This  sexton  describes  the  nueroarchiteo- 
ture  of  the  DRA  scalar  processor  by  first  presenting  an  overvitnt'  of  the  instruction  set 
arcbitocture.  foUowxxl  by  a  description  of  the  pipeline  and  distnissdou  of  special  features. 

4.1  Instruction  set  ardiitecture  overview 


Mucli  Like  the  DLX  architecture  [b].  most  DRA  scalar  instructions  use  a  thrce^Krand 
format  to  sped^'  wo  source  registers  and  a  destination  register,  as  shovn  in  Figure  4. 
For  these  types  of  instructions,  the  opcode  generally'  denotes  a  class  of  operations,  sueb  as 
arithmetic,  and  the  function  denotes  a  specific  operation,  sueb  as  add.  The  C  bit  indicates 
whether  the  operation  performed  ly  the  instruction  execution  updates  oondition  codes. 
Ill  hen  of  a  second  source  register,  a  it>-bit  immediate  value  niaj'  be  siwcificd.  T.Tie  scalar 
instruction  set  includes  the  typical  arithmetie  functions  add.  subtract,  multiply',  and  diridc: 
logical  functions  AND,  OR-  NOT,  and  XOR:  and  logical/arithmetie  shift  operations.  In 
addition,  there  are  a  number  of  special  instructions,  described  in  Section  4.3.  Load/store 
instructions  adhere  to  the  immcxliatc  format,  where  the  address  for  the  nienioiy'  oixration 
is  fonned  by'  the  addition  of  an  immediate  value  to  the  contents  of  rA.  whicb  serves  as  a 
base  address.  ITic  DR  A  scalar  processor  docs  not  support  a  base-plus-register  addressing 
mode  because  it  reciuircs  an  extra  read  port  on  the  register  file  for  store  operations. 
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Figure  4.  DIVA  scalar  arithmetic/logical  instruction  formats 

Branch  instructions  use  a  different  format  (not  shovn  due  to  page  constraints).  The 
branch,  target  address  may  be  PC-relative,  usediil  for  relocatable  code,  or  calculated  using 
a  base  register  combined  wth  an  offset,  useful  "with  table-based  branch  targets.  In  both 
formats,  the  offset  is  in  units  of  instruction  -words,  or  4  by'tcs.  By  spexi^’ing  the  offset  in 
instruction  words,  rather  than  by'tcs.  a  larger  branch  t^'indott'  results.  To  support  function 
calls,  the  branclx  instruenion  format  also  includes  a  bit  for  spcci^'ing  Uukvge.  that  is.  whether 
a  return  instruction  itddrtss  should  be  saved  in  1131.  iTie  branch  format  also  includes  a 
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y-bh  condition  field  to  spod^’  one  of  branch,  conditioua:  alwaj-'s.  (xiuaL  not  ofiuaJ.  lesss 
tlian,  Ittftj  than  or  ociuaL  greater  tlian.  fircatcr  than  or  otiuaL  or  o^^rflo^t'. 

4.2  Pipdino  description  and  associated  Hazards 

A  more  detailed  depiction  of  the  pipeline  execution  control  -unit  and  scalar  daupath  are 
given  in  Figure  5.  llie  pipeline  is  a  standard  DLX-like  o-stage  pipeline  [b],  \Aith  the  follo\\'- 
ing  stages:  (1)  instruction  fetch;  (2)  decode  and  register  read:  (3)  execute:  (4)  uienior>’:  and. 
(o)  TMitebacOt.  The  pipeline  controller  contains  the  necessar}'  logic  to  lumdlo  data,  control, 
and  structural  hazards.  Data  hazards  occur  when  there  are  rcad-aftor-ttTite  register  dc;- 
ix-ndcnexs  between  instructions  that  co-cxist  in  the  pipeline.  The  oontroUor  and  datapath 
contain  the  uoocsBar}'  forwarding,  or  b>'pasB.  logic  to  allow'  pipeline  execution  to  proaod 
without  stalling  in  most  data  dependence  cases.  The  oal;\'  exception  to  this  gcncrahtit'  in¬ 
volves  the  load  instruction-  where  a  “bubble"  is  inserted  between  the  loiwl  instruction  and 
an  imniediatelj’  following  instruction  that  uses  the  loitd  target  register  as  one  of  its  source 
operands,  lliis  hazard  is  handled  with  hardware  intcTlochs.  rather  than  exposing  it.  to  be 
compatible  with  a  pietiousQj’  dc^elopcd  compiler. 


Figure  5.  DIVA  scalar  processor  pipeline  description 

Control  hazards  occur  for  branch  instructions.  Unhlce  the  DLX  architecture  [tS],  w'hich 
uses  expheit  comparison  instructions  and  testing  of  a  gcncral-puipese  register  value  for 
branching  decisions,  the  DIN'A  design  incorporates  condition  codes  that  mi\v  be  updated 
bj'  most  instructions.  /Uthoufdi  a  slight^'  more  complex  design,  this  scheme  obviates  the 
uood  for  scweral  comjiarison  instruenions  in  the  instruction  set  and  also  requires  one  fewer 
instruction  execution  in  c\’ei>’  oomiiarison/braneh  sequence.  The  condition  codes  used  for 
branching  decisions  arc:  EC^  -  set  if  the  result  is  zero.  LT  -  set  if  the  result  is  negathe^ 
GT  -  set  if  the  result  is  posithe.  and  OV  -  set  if  the  operation  oterflows.  Uulilce  the  load 
data  dependenoe  hazard,  w'hich  is  not  exposed  to  the  compiler,  the  DR-'A  piix'line  design 
imposts  a  I-dclaj'  slot  branch,  so  that  the  instruction  foUow'ing  a  branch  Lnstruenion  is 
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alwi^TS  cxocuuxl.  Sinoc  branckcsj  arc  Ksolvod  wthm  the  soeoud  staf!?j  of  the  piix;liiic. 

no  stalls  occur  "with  branch  instructions.  ITic  dclaiVod  branch  was  selected  because  it  was 
compatible  wth  a  pretieusl.v  dc\'cloped  compiler. 

Since  the  gcncral-puiTose  register  file  contains  2  read  ports  and  I  Atrite  port,  it  maj' 
sustain  tt^'o  operand  reads  and  i  result  iwitc  etteo'  clock  tyde:  thus,  the  resister  file  dcsi^ 
introduces  no  structural  hazards.  The  onlj'  structural  hazard  that  mipacts  the  pipcHne 
operation  is  the  node  nicmo:;>'.  Pipeline  stalls  occur  when  there  is  an  instruction  cache  miss. 
The  pipeline  vill  resume  once  the  cache  fill  mcmoi:;>'  request  has  been  satisfied.  Likewise, 
sinec  there  is  no  data  cache,  stalls  occur  anj'  time  a  load/store  instruedou  reaches  the 
memorj'  staf^c  of  the  pipeline  undl  the  me'mor>'  operation  is  completxxL 

4.3  Special  features 

The  nowlty  of  the  DINW  scalar  proex'ssor  lies  in  the  siietial  features  that  supjjort  DPV'A- 
spcdfic  functions.  iUtLouffh  by  no  means  exhaustiw.  this  section  hifihlifihts  some  of  the 
more  notable  capabilities. 

•4.3.1  Ruu-iline  kernel  support 

The  execution  control  unit  supports  supenisor  and  user  modes  of  processing;  and  also 
uuuntains  a  number  of  spocLal-purposc  and  protected  registers  for  support  of  cooception 
handling,  address  translation,  and  general  OS  seniccs.  Exceptions,  arising  fivm  cMecution 
of  node  instructions,  and  interrupts.  fiv>m  other  sources  such  as  an  internal  timcT  or  ex¬ 
ternal  component  like  the  PBIT.  are  handled  bj'  a  common  mcehanism.  The  exception 
handling  scheme  for  DI\A  has  a  modest  hardware  reciuircment,  exporting  much  of  the 
complexity'  to  sofH'are.  to  maintain  a  flexible  impk'nic'utation  platform.  It  protude-s  an 
intcigrated  mcehanism  for  handling  hardware  and  sofb^’are  exception  sources  and  a  flexible 
priority  assignment  scheme  that  minimizes  the  amount  of  time  that  exception  recognition 
is  disabled.  "WTule  the  hardware  design  aUott's  traditional  stack-based  exception  handlcTS. 
it  also  supports  a  nou-recurai\e  dispatching  scheme  that  uses  DPSA  hardware  features  to 
allow  preemption  of  lo^^vr-priority  exception  handlers. 

The  impact  of  run-time  kernel  support  on  the  scalar  processor  design  is  the  addition 
of  a  modest  number  of  spocial-iiuriJosc  and  protocted  (or  supe’n'isor-knel)  registers  and 
a  non-ncgligiblc  amount  of  complexity  addcxl  to  the  pipeline  control  for  cntcring/codting 
exception  handling  modes  cleanly’.  'Wlicn  an  exception  is  detected  by'  the  scalar  i)rocesisor 
control  unit,  the  logic  performs  a  number  of  tasks  -ftithin  a  single  dock  cy'cle  to  prepare  the 
processor  for  entering  an  exception  handler  in  the  next  cloefc  cy'lc.  Those  u\sks  include: 

•  determining  whieh  exception  to  handle  by'  prioritizing  among  simultaneously’  oceur- 
ring  exoeptious. 

•  setting  up  shadott'  registers  to  capture  critical  state  information,  such  as  the  processor 
status  ■w’ord  regisur.  the  instruction  tuldress  of  the  faulting  instruction,  the  memory’ 
address  if  the  exception  is  an  address  fault,  etc. 

•  configuring  the  program  counter  logic  to  load  an  exception  handler  address  on  the 
next  dock  cy'clc.  and 

•  setting  up  the  proce-ssor  status  word  register  to  enter  superwisor  mode  vith  exception 
handling  temporarily'  disabled.. 


302 


Onw  imx>kod.  tlxc  cocoq^tion  luuidlcr  first  stores  other  pieces  of  user  state  and  interrpsates 
Vitrious  pieces  of  state  luuxlvare  to  determine  how'-  to  proceed.  Once  the  cocception  handler 
routine  has  eompleted.  it  restores  usct  state  and  then  executes  a  return-fiom-exception 
instruction,  whiclx  cepies  the  shadotv  register  contents  bads  into  tiuious  state  legistcts  to 
resume  processing  at  the  point  before  the  exception  was  encountered.  E  it  is  impossible  to 
resume  previous  processing  due  to  a  fatal  exception,  the  nm-time  kernel  exception  handler 
maj'  choose  to  terminate  the  offending  process. 

4.3.2  luteractlou  with  the  WldeWoid  dataimUi 

There  arc  a  number  of  features  in  the  scalar  processor  design  involving  comuumicatlon 
with  the  WideWord  datapath  that  grcatl>'  enhance  performance.  The  path  to/fiom  the 
WideWord  datapath  in  the  execute  stage  cf  the  i>ii>eline.  sho\vn  in  Figure  o.  facihtates 
the  exchange  of  data  bcrft\>cn  the  scalar  and  WideWord  datapaths  without  going  throudi 
mcmoi^'.  This  capabilitj'  distinguishes  DR’A  fiom  other  architocturcs  containing  vector 
units,  such  as  iUtiVee  [i].  This  path  also  alloy's  scalar  register  values  to  be  used  as  spocificre 
for  WideWord  funenions.  sueh  as  indiex.s  for  selceting  subfields  vtithin  WideWords  and 
indices  into  permutation  look-up  tables  [4].  Instead  of  tcciuiring  an  immediate  value  tvithin 
a  WideWord  instruction  for  spoci^ing  such  indices,  this  registcr-Utsed  indexing  capabilit>’ 
enables  more  inteJhgcnt.  efficient  code  design. 

There  arc  also  a  couple  of  instruenions  that  arc  espoedaUj'  useful  for  enabling  efficient  data 
mhiing  operations.  ELO,  encode  leftmost  one.  and  CTO,  cicar  leftmost  one,  arc  instructions 
that  generate  a  o-bit  index  conesi^onding  to  the  bit  position  of  the  leftmost  one  in  a  32- 
bit  value  and  clear  the  leftmost  one  in  a  32-bit  value,  rcspoctivelj'.  llicse  instructions 
arc  esiKdafty  useful  for  examining  the  32-bit  WideWord  condition  code  register  values, 
which  maj'  be  trausferrcxl  to  scalar  general-purpose  registers  to  perform  such  tests.  For 
instance,  with  this  eapabihtOr'.  finding  and  processing  data  items  that  match  a  speeifiod  kc^' 
arc  accomplished  in  much  fett'cr  instruetiens  than  a  scfiueucc  of  bit  masldng  and  shifting 
involved  in  32  bit  tests,  which  is  lociuired  with  cmveutional  processor  architectures. 

There  arc  some  variations  of  the  branch/cttll  instructions  that  also  interact  vdih  the 
WideWord  daupath.  The  BA  (branch  on  all)  instruction  specifics  that  a  branch  is  to 
be  taken  if  the  status  of  condition  codes  within  cveov’  subfield  of  the  WideWord  datapath 
matches  the  condition  sjiccified  in  the  BA  instruction.  The  BN  (branch  on  none)  instruc¬ 
tion  specifies  that  a  branch  is  to  be  taken  if  the  status  of  condition  codes  within  no  subficld 
of  the  WideWord  datapath  matches  the  condition  specified  in  the  BN  instruction.  With 
pronKT  code  struenuring  around  these  instructions,  inverse  forms  of  these  branches,  such  as 
branch  on  an-'  or  branch  on  not  all.  can  also  be  effected. 

4.3.3  NDsccUancuus  iustruetlous 

There  are  also  several  other  miscellaneous  instructions  that  add  some  complexitj'  to  the 
protessor  design.  The  probe  instruction  allows  a  user  to  interrogate  the  address  transla¬ 
tion  logic  to  see  E  a  f^obal  address  is  locttEv  mapped.  bTiis  capability'  allows  users  w'ho 
w'ish  to  optimize  code  for  performance  to  avoid  ifiow'.  ov’crhead-laden  address  transdatlcn 
exceptions.  Also,  an  instruction  cache  inv'alidate  instruction  aUews  the  supervisor  kernel  to 
cv'ict  user  code  fix>m  the  cache  without  uivalidating  the  entire  cache  and  is  useful  in  process 
termination  cleanup  procedures.  Lastly',  there  arc  vxrsious  cf  load/storc  instruenions  that 
‘•lock”  memory'  operations,  w'hich  are  useful  for  implementing  symehronization  functions, 
such  as  scmaphoires  or  barriers. 
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5  Liiplenieiitatioii  and  testing  of  the  DIVA  scalai'  processor 

The  spcdficadon  of  the  DI\A  jscalar  processor  rwiuircd  on  the  order  of  lO.WO  lines  of 
MEDL  oode.  consisting  of  a  uiix  of  R'llrlcvel  behavioral  and  fiato-lewl  structural  code.  A 
prelinmuio'.  unoptimized  stand-'Ulone  laj'out  of  the  scalar  processor  cx>iisistcxl  of  23. WO  stan¬ 
dard  whs  (approoduiatcl>’  2W.(JW  transistors)  and  oexupied  i  sq  nun  in  aiS^on  tochnoloiffir". 
It  was  projected  to  operate  at  4W^0Iz  while  dissipating  WmW. 

/Uthotydi  scalar  j^roceasor  is  suitable  for  stand-alone  embedded  implenjentations. 
the  Dr\*A  project  employ’s  it  as  part  of  a  thditb'  integrated  node  desifiu,  as  discusscxl  in 
Section  3.  The  scidar  processor  MEDL  specification  was  included  as  part  of  the  DI\A 
PI^I  prototjTx:  specification,  which  was  i^juthesiaed  as  a  ‘Wa  of  gates"  using  S.t’nopsTt's 
IXsign  jVnalj’zer.  dlie  entire  chip  was  placed  and  routed  with  C-adeace  Silicon  Elnscmble. 
and  plijuical  >r.'rification,  such  as  DRC  and  LVS.  "was  performed  with  Mentor  Calibre. 
The  intellectual  propert>’  building  blochs  used  in  the  chip  include  Virage  Logic  SIL\M.  a 
NurLoglc  PLL  clock  multiplier,  and  ihtisiui  standard  cells,  pads,  and  register  files. 

The  fim  DI\A  PIM  prototj-pe.  showm  in  Figure  2b.  is  a  single-node  implementation 
of  the  DRA  PIM  chip  architecture  and  is  tnirrentl.v  in  test.  Due  to  ehalleagcs  in  gaining 
access  to  embedded  DILVM  fabrication  Hnes  in  a  timelj'  fashion,  this  first  prototJT>e  is 
SILVM-based.  dhis  chip  implements  all  features  of  the  DRA  PEM  architecture  eoceejn 
address}  translation  and  floating-point  capabilities.  A  second  ^'eTsLon  of  a  PIM  chip,  which 
not  onlj'  integrates  these  functions  but  achieves  a  faster  clock  rate,  is  due  to  tape  out  in  the 
second  half  of  2W2.  The  chip  shown  in  Figure  2b  was  fabricated  throu^  MO^S  in  TSMC 
0.ii5/*m  tahnologj’.  and  the  silicon  die  measure's  9.Smm  on  a  side.  It  contains  approodinatefv 
2  million  logic  transistors  in  addition  to  the  ->3  miJlLon  transistors  that  implement  3  Mbits 
of  SILVM.  Ihe  chip  also  contains  3o2  pads.  240  signal  I/O.  and  is  packaged  in  a  3onuii 
TBGA.  ITie  chip  is  c'stimated  to  dissipate  2.oW  at  IW^QIz. 

The  chip  is  being  tested  with  the  use  of  an  IIP  I0702A  logic  analjuis  nuunframe.  Pattern 
generator  modules  applj'  test  ■Noctois  to  the  inputs  of  the  chip,  and  timing/state  capture 
modules  sense  the  outputs  of  the  chip.  41x0  chip  is  eurrentlj'  being  tested  for  functionalitj' 
at  a  testbench  s^ood  of  W^QIz.  :Uthouih  exhaustive  testing  has  not  yet  been  completed, 
the  chip  is  running  a  demonstration  application  of  matrix  trani^cse  that  exc'ioiscs  all  major 
control  and  datapaths  within  the  scalar  processor,  including  man;>'  of  the  siKxhd  features 
highlitiit^d  in  Section  4.3.  Et'c'n  in  this  limited  test  setup,  the  chip  is  performing  G40  MOPS 
while  dissipating  onlj'  80QmW.  Wc  estimate  that  the  scalar  processor  is  oontributing  onl>' 
«0mW  to  this  pow'cr  measure,  iUso.  though  at-speed  testing  has  not  been  completed  jet. 
we  do  not  anticipate  this  prototj’pc  to  operate  much  beyond  IW^DIz  due  to  critical  i>ath 
limitations  in  the  WideWord  datapath.  If  implcmemed  and  optimhKd  separately  as  an 
cmbcxlded  microcontroller,  w'e  expect  the  scalar  processor  to  easil.v  operate  aboi\e  oW^QIz. 

6  Conclusion 

This  paper  has  presented  the  dc'sign  and  implementation  of  the  scalar  PRI  processor 
used  in  the  DR  A  s^jstem.  an  integrated  haidw'are  and  soflwarc  aichitccnure  for  c’xi>loiiLug 
the  bandwidth  of  PEM-based  systems.  iUthough  the  core  of  the  seaktr  processor  design  is 
much  like  a  standard  32-bit  RISC  processor,  it  has  a  number  of  sjxedal  features  that  make 
it  well-suited  to  senirig  as  a  PRI  node  mierocontrollcr.  A  working  implementation  of  this 
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arcdutoaurc.  Ixasod  on  TSMC  O.IS^an  tcchnolosjr'.  kas  prown  the  validity'  of  the  deaifiu.  The 
rcsuldufi  worhstadon  ssj'stem  architecture  that  incorporates  PEvIS  uamg  this  proccsssor  is 
projected  to  achiet’c  speodups  ranfiinf;  ftom  8.8  to  88.3  cn'cr  oont’endonal  wrkstations  for  a 
inunber  of  applications  [o.  0].  These  results  demonstrate  that  "by  sacrifidnfi  a  suuUl  amount 
of  area  for  proocssjuts  l<^e  on  memorTr'  chips.  PEvI-based  j^j’stems  are  a  ■viable  method  for 
combatting  the  memoo'  "Widl  probleun- 
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Iiiipleineiitation  of  a  256-bit  WideWord  Processor  for  the  Data-liiteiisive 
Architecture  (DIVA)  Processiiig-In-Meiiiory  (PIM)  Chip 
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use  Informalion  Sciences  Instilute 
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Abstract 

The  Dntii-hUensive  Arcliiti’ctiire  (DIVA)  system 
incorponiles  Proeessing-ln-Memory  (PIM)  chips  as 
smart-memory  coprocessors  to  a  microprocessor.  This 
architecture  e.xploits  inherent  memory  hamlwkith  both 
on  chip  and  across  the  system  to  target  sex'eral  cIcLsses  of 
handwiJth-IimiteJ  applications,  including  multimedia, 
pointer-based,  and  sparse-matrix  applications.  The 
DIVA  project  is  building  a  prototype  workstation-class 
system  using  PIM  chips  in  place  of  standard  DR.4.Ms  to 
demonstrate  these  concepts. 

.4  key  component  of  this  architecture  is  the  fVidelVord 
Processor,  which  is  a  S-stage  pipelined  256-bit 
datapath,  complete  with  register  file  and  ALU  blocks. 
This  component  offers  fine-grained  data  parallelism 
re.sulting  in  significant  speedups.  This  paper  details  the 
de.tign  and  implementation  of  this  fVidefVord  Processor 
in  TS.MC  0.  IHiim  technology. 

I.  Intr(»ducti<>ii 

I  he  increasing  gap  between  processor  anti  niemoi)' 
speeds  is  a  well-known  problem  in  computer 
architecture,  with  peak  processor  perl’ormance  increasing 
at  a  rate  of  50-60"o  per  year  while  memors'  access  times 
improve  at  merely  5-7°o.  Furthermore,  techniques 
designed  to  hide  memory  latency,  such  as  multithreading 
and  pret'etching.  actually  increa.se  the  memoi)  bandw  idth 
icKiuirements  [3],  .-y  recent  X'FSI  technology  trend, 
embedded  DR.AM.  otters  a  promising  solution  to 
bridging  the  processor-memoiy  gap  |0|.  One  application 
of  this  technology  integrates  logic  with  high-density 
memor>'  in  a  processing-in-memoiy  (I’lM)  chip.  Because 
PIM  internal  processors  can  he  directly  connected  to  the 
memory  hanks,  the  memor)’  bandwidth  is  dramatically 
increased  (w  ith  hundreds  of  gigabit  second  aggregate 
bandwidth  available  on  a  chip— up  to  2  orders  of 
magnitude  over  comentional  DR.AM).  Latency  to  on- 
chip  logic  is  also  reduced,  down  to  as  little  as  one  half 
that  of  a  comentional  memory  system,  becau.se  internal 
memory  accesses  a\oid  the  delays  a.ssociated  with 
communicating  olT  chip. 

The  Data-Intensive  .Architecture  (DIV.A)  project  u.ses 
PIM  technolog)'  to  replace  or  augment  the  memorj 
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system  of  a  conventional  worLstation  with  "smart 
memories”  capable  of  very  large  amounts  of  processing. 
System  bandwidth  limitations  are  thus  overcome  in  three 
ways:  (I)  tight  coupling  of  a  single  PIM  processor  with 
an  on-chip  memory'  bank;  (2)  distributing  multiple 
processor-memoiy  "nodes"  per  PIM  chip;  and.  (.3) 
utili/ing  a  separate  chip-to-chip  interconnect,  for  direct 
communication  between  nodes  on  different  chips  that 
bypasses  the  host  system  bus.  The  system  architecture  of 
DIV.A  is  focu.sed  on  achieving  the  following  four  goals: 

( I )  developing  PIMs  that  can  serve  as  the  only  memory 
in  the  system,  assuming  the  dual  roles  of  "smart 
memories"  and  conventional  memoiy;  (2)  supporting  a 
wide  range  of  familiar  programming  paradigms,  closely 
related  to  parallel  computing;  (3)  targeting  applications 
that  are  severely  impacted  by  the  processor-memory' 
bottlenecks  in  consentional  systems:  sparse-matri.x  and 
pointer-based  applications  with  irregular  memory'  access 
patterns,  and  image  and  video  applications  with  large 
working  sets;  and.  14)  developing  a  \T.SI  device  to 
exploit  memoiy  and  communications  bandwidth  in  PIM- 
based  systems  while  making  efUcient  use  of  on-chip 
resources  for  target  applications. 

Ihis  paper  focuses  cin  the  microarchitecture  design 
and  implementation  of  the  WideWord  Processor 
component  of  the  PIM  processing  logic.  Similar  in  style 
to  vector  extensions  like  .AltiVec  |l|.  the  DIS'.A 
WideWord  Processor  uses  a  25P-bit  datapath  that 
enables  significant  pixx'essing  .speedups  through  the  use 
of  data  parallelism.  Fhe  WideWord  Processor  was 
fabricated  as  part  of  a  DIVA  prototype  chip  in  TSMC 
0.  ISpm  tc'chnology  and  is  currently  in  test.  The 
remainder  of  the  paper  is  organized  as  follows.  Sections 
2  and  3  present  an  oveniew  of  the  Dl\’.k  system 
architc'cture  and  microarchitecture,  to  put  the  WideWord 
Processor  design  into  its  proper  context.  Section  4 
describes  the  WideWord  microarchitecture  in  detail. 
Section  5  presents  details  of  the  fabrication  and  testing  of 
the  WideWord  Prix'essor  as  part  of  a  PIM  chip,  and 
Section  6  concludes  the  paper. 

2.  System  architecture  overs  iew 

.A  driving  principle  of  the  DIA'.A  system  architecture  is 
elTicient  use  of  PIM  technology  while  requiring  a  simxith 
migration  path  for  software,  this  principle  demands 
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integration  ot'  IMM  features  into  conventional  systems  as 
seamlessly  as  possible.  As  a  result.  DIV.A  IMM  chips  are 
designed  to  resemble  commercial  DR.\Ms.  enabling  IMM 
memory'  to  be  accessed  by  host  software  as  if  it  were 
conventional  memory.  In  figure  I.  we  show  a  small  set 
of  IMMs  connected  to  a  single  host  priKcssor  throtigh 
conventional  memory'  control  logic. 


I  I  IMM  IIIPI  IMM  ¥•••/■!  f’lM  I  I 

- J 


IMM-to-IMM  Interconnect 


Figure  1.  DIVA  system  architecture 
Spawning  computation.  gathering  results, 
synchronizing  activity,  or  simply  accessing  non-local 
data  is  accomplished  via  parcels.  .\  parcel  is  similar  to  an 
active  message,  as  it  is  a  relatively  lightweight 
communication  mechanism  containing  a  reference  to  a 
function  to  be  invoked  when  the  parcel  is  received  1 12|. 
from  a  programmer’s  \iew.  parcels,  together  with  the 
global  address  space  supported  in  DIV'.'V.  provide  a 
compromise  between  the  ease  of  programming  a  shared- 
memory'  system  and  the  architectural  simplicity  of  pure 
message  passing.  Parcels  utilize  a  separate  IMM-to-IMM 
interconnect  to  enable  communication  w  ithout  interfering 
with  host-memoiy  traffic,  as  shown  in  figure  I.  Details 
of  this  interconnect  can  be  found  in  [  |n|.  and  more  detail 
about  the  DIV.A  system  architecture  can  be  found  in 
|2|14||6][7|. 


3.  MicroarcliitcctiiiT  overview 

fiach  DIV.\  IMM  chip  is  a  VLSI  memory  device 
augmenteel  with  general-puiyiose  computing  and 
networking  communication  hardware,  .\lthough  a  IMM 
may  consist  of  multiple  nodes,  each  of  which  are 
primarily  comprised  of  a  few  megabytes  of  memory'  and 
a  node  processor,  figure  2  shows  a  IMM  with  a  single 
node,  w  hich  rellects  the  focus  of  the  initial  research  that 
is  being  conducted.  Nodes  on  a  IMM  chip  share  a  single 
IMM  Routing  Component  (PiRC)  and  a  host  interface, 
file  PiRC  is  responsible  for  routing  parcels  on  and  olf 
chip.  I  he  host  interface  implements  the  JliDfiC  standard 
SDR.kM  protocol  so  that  memory'  accesses  as  well  as 
parcel  aetbity  initiated  by  the  host  appear  as 
conventional  memory  accesses  from  the  host  perspective. 

figure  2  also  shows  two  interconnects  that  span  a  IMM 
chip  for  information  How  between  nodes,  the  host 
interface,  and  the  PiRC.  liach  interconnect  is 
distinguisheil  by  the  type  of  information  it  carries.  The 
IMM  memoiy  bus  is  used  for  conventional  memory 
accesses  from  the  host  processor,  fhe  parcel  interconnect 


allows  parcels  to  transit  between  the  host  interface,  the 
nodes,  and  the  PiR(  .  Within  the  host  interface,  a  parcel 
buffer  (PBUf)  is  a  buffer  that  is  memory-mapped  into 
the  host  processor's  address  space,  permitting 
application-level  communication  through  parcels.  liach 
IMM  node  also  has  a  PBlIf.  memory-mapped  into  the 
node's  local  address  space. 


Figure  2.  DIVA  PIM  chip  organization 
figure  shows  the  major  control  and  data 
connections  w  ithin  a  node,  w  ith  the  25(>-bit  memory  data 
bus  as  the  centeipiece.  fhe  DIN'.A  PIM  node  processing 
logic  supports  single-issue,  in-order  execution,  with  M- 
bit  instructions  and  .'!2-bit  addre.sses,  fhere  are  two 
datapaths  whose  actions  are  coordinated  by  a  single 
execution  control  unit:  a  scalar  datapath  that  performs 
sequential  operations  on  .^2-bit  operands,  and  a 
WideWord  datapath  that  performs  fine-grain  parallel 
operations  on  256-bit  operands.  Both  datapaths  execute 
from  a  single  instruction  stream  under  the  control  of  a 
single  5-stage  Dl  .X-like  pipeline  [8].  fhe  instruction  set 
has  been  designed  so  both  datapaths  can.  for  the  most 
part,  use  the  same  opcodes  and  condition  codes, 
generating  a  large  functional  overlap. 


Figure  3.  DIVA  PIM  node  architecture 
fiach  datapath  has  its  own  independent  general- 
purpose  register  file.  .t2-bit  registers  for  the  scalar 
datapath  and  .52  256-bit  registers  for  the  WideWord 
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ikiUipath.  but  special  instructions  permit  direct  transfers 
K.'t\seen  datapaths  without  itoinij  through  memor>-. 
Although  not  supported  in  the  initial  DIVA  prototype, 
tloating-point  extensions  to  the  W'ideW'ord  datapath  will 
K'  provided  in  future  implementations.  In  addition  to  the 
execution  units,  each  DIVA  IMM  node  contains  other 
es.sential  components  of  note.  These  components  are 
described  in  |5|. 

4.  Microarchitccturc  details  of  the  DIN'A 

icIeW  Ol  d  Processor 

I'he  combination  of  the  execution  control  unit  and 
WideWord  datapath  is  regarded  as  the  WideWord 
Processor.  Ihis  component  enables  superword-level 
parallelism  [ll|  on  wide  words  of  256  bits,  similar  to 
multimedia  extensions  such  as  MMX  and  AltiVec.  This 
fine-grain  parallelism  olTers  additional  opportunity  for 
exploiting  the  increased  processor-memory  bandwidth 
available  in  a  PIM.  Selective  execution,  direct  transfers 
to  from  other  register  files.  integration  with 
communication,  as  well  as  the  ability  to  access  main 
memory  at  very'  low  latency,  distinguish  the  DIV.\ 
WideWord  capabilities  from  MMX  and  .\ltiVec.  This 
section  details  the  microarchitecture  of  this  component 
by  first  presenting  an  overview  of  the  instruction  set 
architecture,  followed  by  a  description  of  the  pipeline. 


4.1.  InNiriictinii  set  architecture 
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Figure  4.  WideWord  instruction  format 
.Vs  shown  in  Figure  4,  most  DIV.A  WideWord 
instructions  use  a  three-operand  format  to  specify  two 
256-bit  source  registers  and  a  256-bit  destination 
register.  The  oix'ode  generally  denotes  a  class  of 
operations,  such  as  arithmetic,  and  the  function  denotes  a 
specific  operation,  such  as  add  or  subtract.  The  C  bit 
indicates  whether  the  operation  performed  by  the 
instruction  execution  updates  condition  codes.  The  W 
field  indicates  the  operand  width,  allowing  WideWord 
data  to  be  treated  as  a  packed  array  of  objects  of  eight, 
sixteen,  or  thirty-two  bits  in  size.  This  characteristic 
means  the  WideWord  ATT)  can  be  represented  as  a 
numK-r  of  variable-width  parallel  .M.Us.  The  P  field 
indicates  the  participation  mode,  a  form  of  selective 
subfield  execution  that  depends  on  the  state  of  local  and 
neighboring  condition  codes.  Under  selective  execution, 
only  the  results  corresponding  to  the  subfields  that 
participate  in  the  computation  are  written  back,  or 
committed,  to  the  instruction's  destination  register.  The 
subfields  that  participate  in  the  conditional  execution  of  a 


given  instruction  are  derived  from  the  condition  codes  or 
a  mask  register,  plus  the  instruction's  2-bit  participation 
field.  For  more  details,  see  |2|. 

The  WideWord  instruction  set  consists  of  roughly  .5ii 
instructions  implementing  typical  arithmetic  instructions 
like  add,  subtract,  and  multiply;  logical  functions  like 
.'\ND.  OR.  NO  T.  -XOR;  and  logical  arithmetic  shift 
operations.  In  addition,  there  are  load  store  and  transfer 
instructions  that  prov  ide  for  rich  interactions  between  the 
scalar  and  WideWord  datapaths. 

Some  special  instructions  include  permutation,  merge, 
and  pack  unpack.  I  he  WideWord  permutation  network 
supports  fast  alignment  and  reorganization  of  data  in 
wide  registers.  The  permutation  network  enables  any  8- 
bit  data  field  of  the  source  register  to  be  moved  into  any 
8-bit  data  field  of  the  destination  register.  .\  permutation 
is  specified  by  a  permutation  vector,  which  contains  .52 
indices  corresponding  to  the  .52  8-bit  subfields  of  a 
WideWord  destination  register.  A  WideWord 
permutation  instruction  selects  a  permutation  vector  by 
either  specifying  an  index  into  a  small  .set  of  hard-wired 
commonly  used  permutations  or  a  WideWord  register 
whose  contents  are  the  desired  permutation  vector.  I  he 
merge  instruction  allows  a  WideWord  destination  to  be 
constructed  from  the  intermixing  of  subfields  from  two 
source  operands,  where  the  source  for  each  destination 
subfield  is  selected  by  a  condition  specified  in  the 
instruction.  This  merge  instruction  effects  efficient 
sorting.  The  packimpack  instructions  allow  the 
truncation/elevation  of  data  types  and  are  especially 
useful  in  pixel  proce.ssing. 

4.2.  Pipeline  tiescriplion 

I  he  WideWord  Processor  pipeline  is  a  standard  lil  .X- 
like  5-stage  pipeline,  with  the  following  stages;  (1) 
instruction  fetch;  (2)  decode  and  register  read;  (.5) 
execute;  (4)  memory';  and.  (5)  writeback.  Data  hazards 
occur  when  there  are  read-after-write  register 
dependences  between  instructions  that  co-exist  in  the 
pipeline.  The  controller  and  datapath  contain  the 
necessary'  forwarding,  or  bypass,  logic  to  allow  pipeline 
execution  to  proceed  without  stalling  in  most  data 
dependence  cases.  Register  forwarding  is  complicated 
somewhat  by  the  participation  capability.  Participation 
status  must  be  forwarded  along  with  each  subfield  to 
elTect  correct  forwarding. 

5.  Inipicincntiition  and  testing  nf  the  DIN 

ideW  (»rd  Processrrr 

The  DIV.\  WideWord  Processor  specification 
required  on  the  order  of  25.(M)0  lines  of  \  IIDI.  code, 
consisting  of  a  mix  of  RTF-level  behavioral  and  gate- 
level  structural  code.  .X  preliminary',  unoptimized  stand¬ 
alone  layout  of  the  WideWord  Processor  used  lUO.ttOn 
standard  cells  (.approximately  one  million  transistors)  and 
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occupied  10  sq  mm  in  O.I8um  tcx'hnology.  projeclcil  lo 
operate  at  .i(K)MI  Iz  while  dissipating  500mW. 

Although  the  W'ideWord  PiXKessor  is  suitable  lor 
stand-alone  implementations,  the  DIV.A  project  employs 
it  as  part  of  a  tightly  integrated  node  design,  as  discussed 
in  Section  3.  I'he  WideWord  Prix-essor  \'MDI. 
specification  was  included  as  part  of  a  OIV.'V  PIM 
prototype  specification,  which  was  sNuthesized  using 
Synopsys  Design  .'malyzer.  The  entire  chip  was  placed 
and  routed  with  Cadence  Silicon  Pnsemhie.  and  physical 
verification,  such  as  DRC  and  l.VS.  was  performed  with 
Mentor  Calibre.  The  intellectual  property  building  blocks 
iLsed  in  the  chip  include  Virago  Logic  SR.\M.  a 
Nurl.ogic  Pl.l.  clock  multiplier,  and  .'\rtisan  standard 
cells,  pads,  and  register  files. 

The  first  DI\V\  PIM  prototv'pe.  shown  in  figure  5.  is 
a  single-node  implementation  of  the  Dl\'.\  PIM  chip 
architecture.  Due  to  challenges  in  gaining  access  to 
etnbedded  DRAM  fabrication  lines,  this  first  protot>pe  is 
SR.\M-bascxl.  I  his  chip  implements  all  features  of  the 
DIV.A  PIM  architecture  except  address  translation  and 
floating-point  capabilities.  .\  second  version  of  a  PIM 
chip,  which  not  only  integrates  these  funetions  but 
achieves  a  faster  clock  rate,  is  due  to  tape  out  in  the 
second  half  of  2()i)2.  I  he  chip  shown  in  figure  5  was 
fabricated  through  MOSIS  in  TSMC  0.  IRpm  technology, 
and  the  silicon  die  measures  O.Smm  on  a  side.  It  contains 
approximately  2  million  logic  transistors  in  addition  to 
the  S.'?  million  transistors  that  implement  8  Mbits  of 
SR.\M.  fhe  chip  also  contains  252  pads,  of  which  24u 
are  signal  I/O.  and  is  packaged  in  a  .25mm  TBti.A. 


Figure  5.  DIVA  PIM  prototype  chip 

The  chip  is  being  tested  with  the  use  of  an  IIP 
I67U2.A  logic  analysis  system.  Pattern  generator  modules 
are  utilized  to  apply  test  vectors  to  the  inputs  of  the  chip, 
and  timing  state  capture  modules  are  used  to  sense  the 
outputs  of  the  chip.  1  he  chip  is  currently  being  tcsited  for 
functionality  at  a  te.stK'nch  speed  of  SOMllz.  .Mthough 
exhaustive  testing  has  not  yet  been  completed,  the  chip  is 
running  a  demonstration  application  of  matrix  transpose 
that  exercises  all  major  control  and  datapaths  within  the 
scalar  processor,  including  the  permutation  network 
highlighted  in  Section  4. 1 .  lAen  in  this  limited  test  setup, 
the  chip  is  achieving  640MOPS  and  2.56Cibytes.'s 
memory'  bandwidth  while  dissipating  only  SOOrnW.  We 
anticipate  even  greater  achievements  w  ith  further  testing. 

6.  Conclusion 

this  paper  has  pre.sented  the  design  and 
implementation  of  the  WideWord  Processor  used  in  the 


DIV.A  system,  an  integrated  hardware  and  software 
architc'cture  for  exploiting  the  bandw  idth  of  PlM-based 
systems.  .A  working  implementation  of  this  design,  based 
on  fSMC  u.  I8(.im  technology,  has  proven  the  validity  of 
the  design.  The  workstation  system  that  is  currently 
being  developed  to  u.se  this  component  is  projected  to 
achieve  speedups  ranging  from  8.8  to  .28. .2  over 
conventional  workstations  for  a  number  of  applications, 
fhese  impro\ements  arise  mainly  from  three  sourees: 
deereased  memoiy  times;  coarse-graiti  parallelism  across 
PIMs  to  exploit  system  bandwidth;  and.  wide  on-chip 
datapaths  to  exploit  fine-grain  parallelism,  including 
especially  those  wide  datapaths  within  the  WideWord 
Processor. 
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Abstract.  Tlic  DRA  (Data  luteuaiVe  .\ixiit)ecture)  ^j'steia  uiioo(rpt>- 
ratcs  PrtxeKuig-lB.-\Icnao(is'-  (PEM)  diips  as  smart-mieiwHS'  copiwe»ais 
to  a  biOBt  aucroproceffior.  It  cKpldts  tLc  inlierentls"  liigl;  ou-cliip  mem- 
ory  baud.TS'iddi  and  additLoualis'  ptrov-ides  a  separate  mcmoi7'-to-iu£iuio(n' 
liigL-band-vvidtli  iutcaxcntiiect  acposi  tlie  ^v'steni.  deaga.  tlve  DR  A  ^"a- 
txm  ardiitocture  targets  a  broad  range  of  aprplicatkais,  iucluding  tlioee 
-vsith  irregular  data  acoes  patteni&  At  tlie  same  tinue,  DRA  supports 
famiUar  progranuTUTtg  iraradigms  &oui  iraralH  camputtng  and  odTcis  an 
c^duticnars^  migration  patli  for  application  dct’clopmeiLt. 

Tlie  DRA  project  is  constructing  a  demonstratian  ^■stem  tising  a  con- 
■ventional  suiieracalar  liost  prooc^or  ■w’itli  a  main  metnoos'  composed  of 
ATSl  FIM  cLi]K>  in  place  of  standard  DR.\A'I&  Tliis  t^'stcm  lias  a  not'd 
none  of  operating-^'stem  clialkngcs.  combining  aspects  of  con'\’entional 
“d'umb'''  memory'  management  and  both  shared-  and  disthbiutied-mcmor}^ 
nQultiitrocdsor  operations.  This  paper  describes  our  solutions  to  the 
nacmoi-A''-managemcnt  irroblcms  posed  tliis  multifaceted  en^iraamont. 


1  Introduction 

TliO  Data  IhLcrsiVe  ArdiiiGcLure  (DIVA)  projoa  is  Irtiihling  a  warhsLation-dass 
sv’slcm.  tising  omilioiklttl-DnAM  Lodmologt'  txj  roplace  Lite  maaary  sj’siem  of  a 
oon-veniional  tt«rhsLaLion  willi  “smart  rnGmarusT  capalile  of  \T>rj'  large  amotmts 
of  prooeasing.  Tine  goal  of  tlie  projoct  is  txi  significantly  mluoe  tlie  c>v€T-injcreasmg 
proopssor-niianiiorv’  lanilwidJi  Ijottlcnock  in  oantienLLanal  sj-slcms.  Sttstem  liaml- 
tculili  limiLalkais  are  lluis  otiercame  in  lliree  wajs.  as  Oltisiratod  in  Figure  1: 
(1)  Liglit  coupling  of  a  single  PIM  pmocaswr  witli  an  on-clup  memort’  Irank:  (2) 
ilistril Ailing  mulLiple  procrasar-nteniarj'  notles  per  PIM  clap:  awl.  (3)  utilizing  a 
separate  diip-to-cliip  interronwset.  for  direct  conunimicaijon  Ijeitcoen  nodes  on 
different  cliips  lliat  IjOTaasfs  tlie  lioet  st’stem  l/us. 

Appeared  in  “Worlcshop  on  Intdligent  Mcmea^'  Sj'stems,”  Not'cmber,  2000 
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PIM  Interconnect  Network. 


Fig.  1.  Sj'Htem  .\jxiil3octure. 


Hus  piap«r  <lesmlxB  n^en^ao•  managen^mL  in  DI\'A.  aspocis  of  llu? 
DI\'A  projod  (listingiiisli  its  nwinaiT.’  nianagpnnanL  roqiiironienLs  from  OiaL  of 
cdiGT  PIM-}a9«l  arduLecLurea 

—  Tl«p  PIMs  serve  as  line  only  nwnwrv'  far  a  sLamlartl  liosL  mkropronssvx, 
assuming  line  thial  roU®  of  *isniarL  menrorics’  ami  oon.'STinLional  monoiT.’. 

—  DrVA  Laigcfis  apNfaaiions  dial  are  niosL  9e\«rely  impaidoil  Ijy  die  processor- 
mranoiy  boLtlonocks  in  oomTnLiional  ^®LGms:  spareo-maLrix  arwl  poinLer- 
liased  applicalLons  widi  irregular  nacancay’  ajcwss  pal  Lems,  and  image  and 
iTideo  applicaiions  with  laigo  marking  sets. 

.\s  oonipare<l  to  sj'SLcni-onna-clup  solntions  [6.5],  ami  muliiproocBsors  matle 
up  solely  of  PDvI  cliips  [7,4],  DI\As  support  far  oan\'ienlLanal  memory  aooesses 
from  an  ejcLemal  liosL  pequires  a  dual  ^'iew  of  nienior}’,  from  Llie  liosL  persporth’o 
ami  llie  PtM's  perspeciire.  Ollier  PIM  ardii Lectures  mldrnas  Lliis  cliallmge  bi>' 
restricling  PIM  funciionalily  to  SIMD  eMoaiLian  on  large  streams  of  data,  at 
Llie  liosL’s  ilirecLion  [1.2].  In  DI\A,  w  support  a  much  Ijroanler  range  of  pro¬ 
gramming  paranligma  inoliuling  Lask-lo^^eJ  parallelism  arul  in-memoiy  accesses 
to  pointer  data  stnictures.  As  a  result.  DI\A  requires  a  memory  nrodel  dial 
supports  imlepemlenL  tlireauls  of  oonlrol  arul  efficienL  translaLian  in  memory, 
%illiouL  neceasiladng  liost  intemenLian- 

A  pre\’iaiis  paper  presents  an  a^’e^\’ie^^’  of  die  DIVA  project  and  descriljos 
a  mtmcay  nuulel  to  support  diese  requirements  [3],  Tliis  paper  discusses  llie 
memory  managemenL  support  needed  to  realize  diis  ntcmory  mmloL 
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2  0\*ersiew  of  Memorj^  Model  and  Address  Translation 

Tlio  DI\'A  niGnior}'  niotlcl  allonpls  U3  saUs^v’  a  mmilxr  of  poLenLially  oonflicLing 
rcqiiimncnLs: 

—  Lo  Lens  of  PIM  diipe: 

—  cffidenl  laanl\\'are  nvodianifaiis  aniflnaljlp  lo  suaigliLfcin^’aitl  inipLen-ienLalian; 

—  an  aLelraci  niadiine  ooniprelienialjle  lo  bo  Jj  programmers  arul  oanipiler  ■wtLl- 
ers: 

—  oompaliliiliiy  wiili  oon\r!iitianal  nuGmarj*  models  anti  memoiy  inierfaoes: 

—  support  for  vuriual  memooy  (ije.,  paging  to/from  disk)  and  srft’apping;  and. 

—  supporting  oonod  ftmeiionaliLy  of  boili.  sliarod-  aiul  distril)(ULe(.l-memica7.' 
programnting  n^ade]s. 

Tlie  unifj'ing  oanecpi  for  ilie  DI\'A  memorj’  mode]  is  oonunimicatian  ^ia  a  gloljal 
adtlrcas  space  sQiaroil  bj*  die  liosl  proceasor  and  all  die  PIM  notle  proocfflors. 
Not  all  ntemcoy'  neod  be  sliaroiL  liow-ewr,  so  our  liartl'tt’are  siipparis  local  PIM 
aiklreas  spaces  as  wll.  All  of  die  procsasors  in  our  demonsLraiion  sj’stem  use 
32-biL  atltlreases.  Iail  die  model  can  I*  aihranLagoously  exientletl  lo  ftiiupp  64-ljiL 
svatema 

To  inierprei  atldreases  in  PIM  ootle  anti  tlala,  a  PIM  proceasor  miisi  suppari 
a  ixanslaLLon  medianism.  Ilowev’er,  die  space  ami  Lime  ovierlieaid  of  mainLaining 
con.%'enLianal  page  LalJes  at  eaeli  noile  is  proialiiLiw.  To  sintpltlj'  translation 
lianHare.  ■w'e  classifS’  DI\-A  memorj'  actording  lo  usage: 

—  gloltal  memorv’  is  a  single  address  gwoe  tlislxilAiLed  across  nodes,  ■\isiljlp  lo 
applications  running  on  llie  lioet  anti  PIM  nodes. 

—  dumb  memory’  is  a  region  of  a  notle  s  menroiy’  aUocated  as  con'^'enLional  pages 
in  a  lioet  appLication’s  ^’i^t^Ial  spaoe  ami  tmtoucliod  b>’  PIM  node  processing. 

—  local  nienrooy'  is  a  region  of  a  notle  s  memory  usotl  almost  eocclusiwly  Ijj- 
PIM  routines.  Certain  exceptional  functions  of  die  lioet  operating  sj'stem- 
sucli  as  initialization  ami  contejcL  management,  •w’ill  also  aoopss  diis  memory 
occasionally,  requiring  wH-defined  tlata-sliaring  cc(n■^^enlions. 

Tlie  pliysical  nremoiy'  on  eacli  PIM  cliip  is  flexildy  partitioniwl  into  diese 
diTOP  ilistlnct  uses.  Dimtb  memory’  is  marugeil  exclusiwly  liy'  die  lioet  operating 
systent  in  stamlanl  ■w'ajs.  i^’idi  atltlrcss  translation  liantUetl  scdely  by’  die  licet 
processor *s  momoty’-nianagomenL  lianl\\'are-  Figure  2  tlepicls  tlie  two  more  in¬ 
teresting  uses  of  PIM  menooiy’,  as  part  of  die  slianul  glolial  adtlress  space,  or 
as  PIM  local  memory'.  Tlie  DRAM  memory’  afflociatal  i^'ilh  die  gloljal  atltbess 
space  is  plij’sicaUy  tlistrilmtcd  across  all  die  PIM  nodes  involml  in  a  oompu- 
tadon.  Addresses  in  die  glolial  \’irLual  atltlrcss  space  are  oansistent  for  die  licet 
ami  all  PIM  mxle  processors,  so  dial  pointer-based  tlala  sinicturcs  can  !«•  freely 
sliared.  In  oonirast.  wliile  die  lioei  lias  pliysical  access  to  tlie  PIM  DRAJM  us«l 
as  local  memory’,  die  licet  ami  PIM  notle  proteasorB  will  see  it  at  tlifforent  virtual 
addresses. 
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Fig.  2.  Sliaml  gLobal  ^^meuts,  utuiliared  local  segnusio^ 


Hip  }ua8t  proorasor  can  acoass  PIM  manicaTj’  via  its  nwnwrj’  Iais.  To  avoul 
saLuraiion  of  Lliis  Iais.  PINI-Lo-PEM  ooninuinicalLcms  oocoir  primarlt'  l>}’  means 
of  a  disiincl  liiglj-ljaiwlv'ixlLh  network  lA>LA^vP!n  PIM  cliips.  Tlie  lianl-warediroclly 
supports  sliar«l-nienwr\'  operations  lAAwocm  tlie  lioet  anti  PIM  menioriffi?.  lAit 
PDvI-to-PIM  oommuniraiions  are  implemental  Lj'  network  oontmtinlcations  in 
tlje  form  of  parcels  (Section  i3).  Parcel  operations  are  liartl'W’are  aasistaL  lAit 
require  safL-ware  prooeasing  1^'  eLtlier  user-  or  supervisop-kn’el  o»le  at  Ixitli  emlsL 
Efficient  nctw-ork  interface  anti  inlemipl  nieclianisms  liare  l>oen  developed  to 
support  parcel  ftinetions. 
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Figi  3i  FIM  uode  ]iroc«BQ(r  address  map. 
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2.1  Address  Ttanslaiion  for  Locally  Mapped  Data 

A  node  nujsL  Le  al>le  to  rapidly  deiemiine  if  an  address  is  located  in  its  owti 
nuemorj-,  and  if  SO,  find  tlie  plij'Sical  address.  Eacli  node  tlierefone  maintains 
translations  for  virtual  addrOSCS  Currently  iTCSiding  On  it.  including  local  niemoiy 
and  its  portion  of  global  memory.  To  condense  translaLion  information,  we  use 
segments,  eadi  of  wiiicli  is  definod  by  segment  registers  containing  a  plij-sical  base 
address  and  limit.  Tlie  segments  are  described  by  tlie  PIM  address  map  sliowm 
in  Figure  3l  Tlie  local  memory'  re^n  is  partitioned  into  cij^iL  SCgnicntS  at  fixed 
yirtual  bases,  for  kemd  code,  stack  and  data,  user  code  and  data/stack.  and 
for  kernel  and  user  network-conmiunication  buffers.  A  small  number  of  ^bal 
segment  registers  are  also  used:  since  global  segments  must  be  able  to  map 
portions  of  a  sliared  tirtual  address  space  mudi  larger  tlian  the  physical  memory 
of  an  individual  node,  ^bal  segments  must  be  represented  by  bodi  a  virtual 
and  physical  base  address  register. 

Figure  i  diov^s  tlie  \'irtual  address  map  for  a  PINf  node  processor.  Tlie  virtual 
size  of  cadi  ixgion  of  the  map  is  sliowm  on  tlie  left:  typically  only  a  fraction  of  tliat 
yirtual  address  grace  will  be  used,  as  noted  on  die  riglit  of  die  diagram.  Tlie 
low'esL-naddressed  (bottommost)  segments  of  tlie  addres  map  define  tlie  local- 
memory  user-mode  space  for  tlie  current  process.  Tlie  next  group  of  segmenis 
defines  tlie  local-memory'  supervisor-mode  space  for  die  kernel,  wliidi  is  tlie  same 
for  all  processes.  Tlie  kemd  can  also  access  tlie  PIM’s  DRAXf  -ftitliout  address 
translation  \ia  die  physical  space  re^on. 

Eadi  PDf  lias  a  small  number  of  relocation  refers  to  aOow'  it  to  map 
portions  of  tlic  sliar^  global  address  g>ace  to  die  ncxle’s  phyTsical  nremoiy'.  Tlie 
aggregate  size  of  diese  ‘Vyindo^'s"  into  die  diared  address  space  is  limited  by  die 
amount  of  physical  memory'  available  on  a  node.  Tlie  top  re^on  of  die  map  is 
unused,  but  nescrvxd  to  conform  wdi  die  host  Operating  g-stcm’s  address  ma3>. 

2.2  TVanslatmg  Ibomoto  Addrossos 

Access  to  parts  of  die  global  address  g>ace  not  mapped  to  physical  memory  on 
die  node  is  possible  yia  die  network,  but  less  efficient  dian  a  mapped  access. 
A  pared  must  be  sent  to  die  node  wliidi  contains  die  physical  memory  to  be 
accessed,  and  a  response  pared  reodved  and  processed,  ddier  by  user  or  kernel 
code. 

DfVA  determines  die  location  of  remote  data  yia  a  tMO-stage  process.  Tlie 
yirtual  address  of  a  datum  is  liaslied  by  liardware  to  determine  die  ‘fiiome  node” 
of  die  datum  [8|.  Tlie  liome  node  may  or  may  not  be  die  present  physical  locadon 
of  die  datum,  but  seryes  as  die  centralized  directory'  and  manager  for  it.  Tlie 
liome  node  will  eldier  perform  the  operation  itself,  if  die  datum  is  resident,  or 
forv^ard  die  request  to  anodier  node,  if  tlie  datum  resides  elsewiiere. 

Tliereforc.  a  node  must  maintain  translalLon  infomxadon  for  only'  ea^iL  local 
segments  plus  a  small  number  of  segnients  for  its  portion  of  die  ^bal  memory',  as 
yy*^  as  for  any  ^obal  data  for  widcli  it  is  die  liome  node:  Tlie  major  advantages  of 
dlls  approadi  are  tliat  translation  be  accomplislied  rapidly',  and  Lransladon 
informadon  on  eadi  PLM  scales  yy'eOl. 
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2.3  Parools 

AH  PIM-w>-PIM  ncivs<ork  coniniunicalions  are  pofomu&d  by  scmling  and  receiv¬ 
ing  nicasagcs  in  Llie  form  of  poreols.  Parcels  are  an  object-based  varianL  of  acLire 
messages  [9],  disUnguislied  from  active  messages  ia  tlial  tlie  destination  of  a 
parod  is  an  object  in  memoiy,  not  a  specific  node.  Prom  a  pregrammexTS  vie\\^ 
parcels,  togellier  \\itli  the  ^bal  address  space  supported  in  DfV'A.  prev^ide  a 
compromise  between  Llie  ease  of  programming  a  sliared-memon,'  SJ'Stem  and  tlie 
ardutecfural  simplicity  of  pure  message  passing.  Hcmote  operations  or  aoceascs 
can  be  acooniplisiied  tluougli  pared  sends  and  receives:  application  programs 
only  need  specify-  tlie  address  of  an  object,  and  not  tlie  processor  upon  wliidi 
tlie  object  resides. 

Structurally,  a  pared  packet  lias  a  2.!i6-bit  payload  and  9G-bit  header,  wliidi 
indudes: 

—  tlie  memooy'  address  of  die  target  object,  expressed  in  die  application's  ad¬ 
dress  space. 

—  tlie  environment  id  (eul)  of  tlie  process  in  tlie  liost  tliat  is  execHiting  die 
applicadon. 

—  a  conmiand  identify-ing  the  function  to  be  performed  by  die  node  associated 
wdi  die  target  when  die  pared  arrives. 

Tlie  2iiG-bit  pOO'load  serv'es  as  arguments  to  die  conmvand.  A  parcel  requiring 
more  bits  must  be  sent  in  niuldpie  packets.  Tlie  payload  size  nratdics  die  PIM 
node  data  bus  widdi:  streaming  packets  may  be  sent  in  a  single  bus  cyde.  Tlie 
network  interface  supports  bodi  user-  and  supen-isor-mode  access  to  parcel  send¬ 
ing  and  receiving  liardware  via  die  user  and  kemd  pared  buffer  segments.  Vser- 
nii^e  parcel  processing  is  more  efEdent  but  less  robust  dian  kernel-mediated 
operadons.  so  viH  typically  be  rcstriclod  to  compiler-generated  code  or  library- 
routines.  Error  condidons  cause  invocadon  of  ddier  user-  or  supervisor-mode 
liandleis. 


3  Ot-’ertiew  of  Memorj'  IVIanagoment 

Menioiy-  managienient  fimedons  are  div^kled  between  t\tx>  tvpes  of  kernels  in  tlie 
DrV’A  gv-stenn  On  die  luosl  processor,  die  standard  operating  sy-stem  (in  DfV’A. 
Linux)  is  augmented  widi  funcdonality'  to  support  PIMs.  On  cadi  PIN!  proces¬ 
sor,  diere  is  a  tiny'  run-time  kemd  diat  is  alwiCrS  resident.  A  primary-  responsi¬ 
bility-  of  diC  PIM  run-time  kemd.  is  to  manage  pared  communicadon  between 
PIMs  [3].  Tlie  run-time  kemd  performs  buffer  management  of  incoming  parcels, 
and  dir^s  context  Switches  between  different  tliresbds  in  die  same  user  pro¬ 
gram.  or  between  user  program  and  kemd.  Tiie  nm-time  kernel  also  perfomis 
required  software  intervendon  in  response  to  interrupts  and  exceptions  on  tlie 
PIM  proocssor.  In  addidon  to  dicse  autonomous  functions  of  die  PIM  run-time 
kernel,  it  also  must  coUaboralc  witli  die  host  on  gystem-lev’d  operations,  sudi  as 
loading  PIM  programs  and  data,  memory-  management  of  PIM-visible  segments. 
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and  PINI  conijeocL  s\Aitdies  beUTeen  diffemiL  xisca-  programs  (note  dial  most  KosL 
context  sftitdies  w’-Ul  not  involve  Uic  PIMs). 

Tills  dI\T^on  of  labor  is  motivated  by  the  dual  goals  of  Itceping  die  PIM 
run-time  kernel  as  Small  as  possible,  and  maJdng  only  raodiCrate  dianges  to  tlie 
standard  function  of  tlie  I1O6L  operating  sj'Stem.  Unlike  standard  multiprocessor 
systems,  tlie  host,  wliidi  lias  a  sj'Stem-lev’el  \iew%  renrains  a  central  figure  in 
Sv’Stem-lc\el  sdieduling.  disk  I/O  operations,  and  memoiy'  management.  Tlie 
cjiaHenge  in  tliis  collaboration  beueen  Lost  and  PIM  sj'Stem  software  is  tliat 
tliere  are  really  two  ^^e\^'S  of  memon,"  tliat  must  be  maintained.  For  dumb  pages 
and  for  disk  I/O  of  PIM-\TSible  segments,  tlie  liost  sees  memoiy  as  standard 
4ICbyte  pages  tlie  PINI  run-time  kemel  instead  ■vdew'S  PIM-visible  memooy'  as 
variable-si^  segments.  Rjccondling  tKcse  two  views  tlirou^i  different  sj'Stcan 
functions  is  die  subjea  of  die  remainder  of  dus  paper. 

4  Memory'’  Allocation:  Virtual  ■vs.  Phj^ical 

Tlie  pordon  of  memoiy  used  by  die  luost  as  dumb  memoiy'  is  managed  by  die  liost 
operating  sj'Stem  using  standard  aHocadoiu  paging  and  swapping  modianisms. 
Tlie  memoiy  dey'Oted  to  PIM  local  memoiy,  and  ^obal  sliared  memoiy.  must  be 
managed  yia  a  coUaboradon  betw'een  host  and  PIMs.  Tlie  most  unusual  aspect 
of  dlls  COUaboration  is  memory'  ^OCalion. 

Figure  4  sliow'S  tlie  funedons  aasodated  wdi  memory  allocadon.  and  wliedier 
diey'  are  performed  by'  host  or  PIM.  Tliere  are  dmee  pliases  to  allocation:  (1) 
luost  allocation  of  contiguous  \irtual  address  spaces  for  global  and  PIM  local 
segments  using  die  Reserve  functions:  (2)  physical  allocation  of  an  object  and 
binding  to  resented  virtual  segments  and.  (3)  mapping  of  existing  global  objects 
to  a  global  segment  for  sliaring  between  PIMs.  Deahocadon  (ClcbalFhee)  frees 
physiesd  memon'  but  docs  not  slirihk  die  yirtual-space  allocadon. 

Tlic  Standard  memon'  allocadon  funedons  malioc  and  five  can  be  used  on 
either  die  liost  or  PTMs:  die  meaning  depends  on  where  the  functions  are  exe¬ 
cuted-  On  die  host,  a  call  to  malloc  performs  a  standard  allocadon  from  dumb 
memory.  On  die  PIMs.  it  allocates  memory'  from  die  PI\rs  local  heap  segment. 
Memoiy  obtained  from  malioc  is  private  to  a  process  and  undiarable. 

4»1  "Virtiial  Mcmiory  Allocation 

Using  a  segmented  approacli  reijuires  diat  data  in  a  segment  reside  in  contiguous 
virtual  addresses.  For  diis  reason,  as  part  of  die  allocadon  process,  we  must 
reserve  a  contiguous  diunk  of  die  virtual  address  space  for  eadi  segment  prior  to 
physicad  allocadon.  Tiie  virtual  memon'  allocadon  is  performed  by  die  host  using 
die  Reserve  funedons  for  global  and  local  segments.  Because  die  virtual  address 
space  is  quite  large,  diese  lesenadons  sliouhl  alwact-s  strire  to  overestimate  die 
space  requirements  of  die  segment,  pardcularly  since  growling  a  segment  beyond 
wliat  was  initially  resen'cd  results  in  very*  oosdy  adjustments  in  v'lrtual  and 
pliysicad  aUocadons.  Linux  supports  diis  reserv-adon  process  by  dusiering  free 
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Allocate  Virtual 
Address  Space 
(Host  Only) 


Allocate  Physical 
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Existing  Object 
(Host  or  PIM) 


4m  KoeiC  and  PIM  memoo"  maiLageanicaLt  fuuctioas  and  ste]£  of  uksuot}^  a]locaQcni. 
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pages  in  Llie  vinual  aiklreas  space  icgeilier;  reseinaLions  adect  a  duster  tliat 
niatdies  Uie  ne^iuested  size. 

Multiple  ^obal  segments  can  Le  resm-ed  by  separate  invocations  to  Uie  Ito- 
senvClobolSefftneni  iunciion.  Eacli  ^obal  segment  rescnation  wiH  create  a  new" 
segment  witli  a  unique  name  (see  Section  7.2).  Tlie  segment  name  can  subse¬ 
quently  be  used  optionally  by  allocation  lunctions.  as  discussed  bdow".  Sinular 
functions  exist  to  allocate  the  virtual  address  space  for  PEM-local  code,  data 
and  stack  segments.  Tliese  are  italidaed  to  indicate  tliat  tlie>'  are  optional,  since 
standard  dcfatilt  \'alucs  often  suffice. 

4»2  Physical  Moniory  Allocation 

Tlie  plnsical  allocation  is  perfomied  tloougli  a  collaboration  between  liost  and 
PTM.  .As  part  of  phj-sical  allocation,  die  page  table  entries  on  tlie  host  are  fiUeiL 
On  the  PI\[  side,  segment  registers  mav  be  updated. 

Tlic  functions  sliowm  in  Figure  4  allocate  a  specific  object  to  a  segment. 
IIow'e\'er,  global  and  per-node  local  meniorj'  allocation  and  deallocation  could 
SW"anip  die  liOSt  operating  sj'Stem  witli  fine-grained  memoiy'  allocation  requests. 
Beliind  the  scenes,  w'e  distribute  dds  task  using  a  Lw^o-le^d  sclienie  wdieie  coarser- 
grained  requests  to  tlie  beet  are  made  by  eadi  PTM  run-time  kernel  to  leplenlsli 
locally  managed  memoiy'  pools  of  preeilocatod  ^obal  and  local  memoi:^'.  Tliis 
approadi  keeps  tlic  liOSt  involved  in  meanoiy'  allocation,  but  sdll  pemuts  die 
PIMs  to  allocate  memon'  independently  as  needed  for  mana,^ng  poinLer-based 
and  Otlier  d^Tiamic  Structures  during  PTM  computation. 

Tlie  DI\'A  progranmiing  model  offers  a  globally  addressable,  distributed  ad¬ 
dress  space  on  sliared  data-  PIM  applications  perform  correedy  wTien  accessing 
non-local  memon,',  eidier  by  communicaUng  \Ta  die  parcel  medianisni.  or  by 
retrieving  data  in  response  to  a  more  expensiv’e  address  trandation  fault.  Ne;*- 
erUieless.  just  as  widi  distributed-siiared-mcanor}'  arddiecturcs.  to  addeve  die 
best  performance,  an  applicadon  must  whenever  possible  eo-locate  data  witli 
die  computation  dial  accesses  it.  For  diis  purpose.  Iheae  are  two  flavons  of  mem¬ 
ory  allocaUon  functions.  Tlie  GMMiLMtJhcToNode  function  aasodates  allocated 
data  to  a  specific  virtual  PIM  home  node.  An  opUonal  sefftnentN^ame  argument 
permits  diis  ahocation  to  occur  widiin  a  specific  i^obal  segment.  To  allow  two 
related  objects  to  be  collocated  widiout  requiring  die  virtual  PTM  identifier,  die 
Global^failoc'IbA^Idrvss  function  instead  permits  djTiamic  allocaUon  of  objects 
to  die  same  virtual  PIM  node  and  global  segment  upon  wiiidi  anodier  datum 
resides. 

To  simplifv'  die  programming  model.  CMfolMoUoc  funcdons  performed  on 
die  PIM  niatdi  die  interface  usetl  on  die  host.  Most  of  the  dnxe  diese  funcdons 
will  be  used  to  aUocate  data  from  die  PIM’s  locally  resident  ^bal  segments, 
but  it  is  possible  for  a  PIM  to  perform  an  aJlocadon  on  a  remote  PTM  node.  Tlie 
effect  of  sudi  an  ahocation  is  to  ahocate  virtual  addresses  fiom  die  remote  nod& 
and  locally  map  the  object  to  die  retiuesdng  PTM's  global  segments  and  phj-sical 
storage.  Sudi  an  allocaUon  can  be  performed  to  support  effident  updates  of  die 
remote  data  prior  to  forwarding  dicon  to  die  remote  node  (see  Secdon  7.3). 


319 


4.3  Mapping  Elxistlng  Objects  to  Global  St^nonts 

Lik*  the  GlobaL\faiIcc  to  a  mnoLe  nod&  it  is  someLinues  desirabilo  to  leniporarily 
map  non-restdenL  ^oLaJ  data  tjo  facilLaie  sliaring  among  PIMs.  Tlio  doiniMop 
function  pcrfomis  tiiis  mapping  tX)  global  segment  regJSLCrS.  and  Glob<dUn\f<tp 
returns  liie  data  to  its  liome  node  (see  Section  7.3). 

5  Pagmg 

To  perfomi  computations  requiring  access  to  global  data  structures  larger  tlian 
tlie  actual  amount  of  plnsical  meniorj',  we  support  a  virtual-memor}'  ‘■paging*' 
medianism  for  PINI-proocss  memory.  (\Ve  use  Uie  sLiglitly  inaccurate  term  ‘pag- 
ing*'  in  preference  to  tlie  tedmicaUy  preferable  term  ‘i^obal-segment  access  fault 
management*’  for  bietil^'.)  If  the  memory'  access  wiiidi  caused  tlie  fault  refer¬ 
ences  a  datum  resident  in  a  PIM  memory,  it  can  be  re961\'ed  witliout  troubling 
tlie  best  operating  s>'Stem.  by  retrieving  Uie  accessed  datum  via  a  parcel  request 
to  its  liome  node.  On  Uie  otlier  liand.  if  Uie  liome  node  returns  a  message  indi¬ 
cating  tliat  Uie  reriucsted  datum  Is  not  resident  in  Uie  sv-stem's  plivTSical  memory, 
tlie  initiating  PIM  kernel  must  request  paging  service  fiom  Uie  iiOSt,  wiiidi  is 
coimected  to  Uie  disk  Lacking  store. 


Host  Action  PIM  Action 


Fig,  5,  Pagjng  s(3quciicc. 


.4s  sliown  in  Figure  5.  Uie  faulting  PIM  process  is  suspended  unUl  Uie  datum 
is  paged  in  from  disk,  and  UiC  liOSt  kernel  becomes  Uie  OWTier  of  Uiat  process 
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COnWDcL  during  tJie  nneniOr\'  roorganizadon-  Tlie  iiOSL  ^  a  S&cdon  of  UiO 

gloLal  ^orLua]  spaoe  oonLaining  the  datum.  Tlio  pagcd-in  secUon  is  ij-pically 
mappod  10  a  disdncL  segment  of  die  globaJ  space.  I  Wever,  if  die  faulting  address 
is  adiaiCGnt  to  an  existing  locally  mapped  segment,  tlie  segment  maj'  be  extended 
to  contain  it. 

After  die  liQst  kernel  lias  resohed  the  fault  and  adjusted  die  faulting  process' 
PIM  context  mappings,  it  returns  ownersliip  of  die  context  to  die  PI\I  kernel, 
wliidi  reloads  the  PINI  address  translation  liarelware  when  it  reaedyates  die 
process. 

Tlie  paging  sjstcm  is  a  useful  but  relalh'ely  expendve  feature,  best  used 
sparingly. 

6  Contexts  and  Swapping 

A  DIVA  PTM  node  supports  yerj'  effident  oanteict  switcliing  for  tlie  most  common 
cases,  eidier  switdiing  bet\\'een  a  user  program  and  die  PIM  run-time  kernel,  or 
between  two  distinct  dmeads  widiin  die  same  user  program.  Switdiing  to  die 
run- time  kernel  requires  no  diange  to  segment  re^sters.  and  rexjuires  minimal 
saving  and  restoring  of  register  stale.  Switdiing  between  different  direads  in  die 
same  user  program,  sudi  as  when  performing  die  command  associated  widi  an 
incoming  parocl.  rciiuircs  niodificadon  tO  Only  twO  of  tliC  Segment  registers,  but 
does  retjuire  saving  and  restoring  of  pordons  of  UiC  register  state.  In  either  case, 
diere  is  no  need  to  sv\"ap  memorj-  in  or  out  in  response  to  a  context  suitdi. 

In  performing  its  normal  job  sdicduling  funedon.  diC  host  ma>’  direct  die 
PCM  node's  context  to  diange  to  a  different  user  program  tliat  rociulrcs  PIM 
funcdonality.  In  diis  case,  a  full  context  switdi  is  neocasaiy',  saving  all  die  reg¬ 
ister  state  as  well  as  updating  die  program-specific  segment  res^sters.  Rndier, 
memoiy  maj'  need  to  be  svupped  in  or  out.  ff  die  user-code  plijTSical  niemoiy  is 
swapped  out  and  rexyded.  die  content  of  die  PIM  node  prooeasor's  instruedon 
cadie  must  also  be  int-alidated  by  software,  since  die  new  program's  code  mcm- 
orj'  ma>'  ot'Crlap  widi  die  pre\ious  program's.  (Our  processor,  like  many  odiers. 
docs  not  enforce  ooherencj'  in  die  instruedon  cadie  liardware.)  Note  tliat  at  any 
time,  die  host  ma>’  be  executing  in  a  different  context  from  one  or  aU  PIMs: 
for  most  host  context  switdies.  it  will  not  be  necossary'  to  diange  die  PIM  node 
context. 

Tlie  host  operating  SJ-Stem  is  responsible  for  creating  contexts  for  die  PIMs. 
and  also  for  updating  contests  in  response  tO  major  System  context  SwitdiOS. 
(Liglitweiglit  PIM  context  switdies.  e.g..  multldireading.  do  not  involve  die 
liost.)  To  feicilitate  host  management  of  contexts,  during  inidallaadoin  die  liOSL 
creates  a  data  structure,  mapped  to  die  PINFs  memorj',  dial  it  sliarcs  with  die 
PIM  run-time  kernel.  "NMiile  it  Is  possible  for  die  host  to  build  dlls  data  struc¬ 
ture  dirougli  a  series  of  parcels  sent  to  die  PIM  run-time  kernel,  for  effidencj' 
we  permit  die  liosl  operating  sjstem  in  prijileged  mode  to  wnite  directly  into 
pordons  of  dic  PIM  run-time  kernel  data  segment.  Tlic  liost  also  updates  diis 
structure  in  response  to  a  sj'stem  context  switdi. 
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C.1  CantGnts  of  Conljcxt 


Figure  G  grapliically  depicts  LiiC  conlenis  of  a  conwocl.  On  initlalizaiion  of  a 
\iser  program-  Line  liQSL  pcrfomis  virtual  niiCniOn-'  location  of  tliO  segments,  as 
discussed  in  Section  4.1,  and  wites  die  range  of  aHocatod  \'irtua]  addresses  into 
die  oomteict  data  structure  in  tlie  PTM  run-time  IflCmiel  segnient.  Tiie  remainder  of 
die  oonteja  is  reset  to  default  talucs.  As  a  result  of  phj-sicaJ  memory  allocations, 
die  plrt'sicaLl  segment  mappings  are  added  to  tliis  structure.  Tlie  remainder  of 
die  fields  are  filled  in  by  die  PIM  run-time  kernel  when  sa\ing  stale  as  a  result 
of  a  context  s^^itcli.  Wicn  lliis  oonteoct  is  restored.  Ibe  host  updates  die  segment 
mappings  as  needed. 


Qxle,  stack,  local  heap 
Shared  data  windows 
?2  32b-wide  entries 
32  256b-w'ide  entries 
32  64b-t\ide  entries 
Scalar  &  WideWord 
Network  interface  state 

Figi  0i  Okmteuts  of  ountext. 


Local  Segment  Mappings 
Global  Segment  Mappings 
Scalar  Register  Set 
WideWord  Register  Set 
Scalar  Floating-Point  Set 
Condition  Codes,  etc. 
Parcel  Buffer  State 


Gt2  Swapping 

Swapping  is  anOlliCr  modianisni  for  supporting  compulations  witli  large  memon' 
refiuirements.  iXIany  computadons  can  be  baoken  up  into  distinct  pliases  wliidi 
need  not  be  simultarncousl^'  aedw.  Peak  nicmoiy  refiuirconenis  maj'  be  reduced 
by  swapping  out  inactive  processes  or  low  priori  tj'  active  processes.  Swapping  is 
SOmcwliat  similar  tO  pa,^ng:  tliO  priniaiy'  distinctions  are  dial  die  entire  context 
is  mo\'ed  to  diC  disk  backing  Store,  freeing  all  die  process  memor>',  and  lliat  die 
liosl  operating  sjTStem.  radier  dian  die  PIM  kernel,  initiates  the  swap  as  part  of 
its  0\'eraU  SClieduling  function. 

Tlie  sequence  of  acdons  required  to  effect  a  process  swap  and  restore  is 
sketdied  in  Figure  7.  As  in  die  paging  setiuence.  die  ownersliip  of  die  process 
context  and  its  associated  resources  passes  from  die  PTM  kernel  to  the  host  op¬ 
erating  sjatem  wlien  die  PIM  process  is  suspended.  Restoring  a  process  context 
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from  an  OuL-Swapped  Siaoe  Is  quite  similar  to  tiie  initiaj  inStanUation  of  tlie 
process. 

Swapping  out  aconteoct  frees  most  local  ppoceas  resources  for  reuse,  but  docs 
not  free  meonorj'  used  tx>  store  global  segments,  sinoe  tliiej'  are  Bicely  to  be  in  use 
by  related  processes  on  otlner  PINf  nodes.  Tlie  ^obal  memorj’  use  well  exceed 
tlie  local,  so  tins  is  a  potentially  major  problem  for  our  storage  redamation 
capabilities.  To  be  able  to  effectively  manage  groups  of  related  processes,  and 
to  be  able  to  decide  wlucn  tlidr  associated  global  segments  nreo'  be  swapped 
out,  %'e  adopt  a  sj'Stem  of  naming  global  segments,  discussed  in  Section  7.2. 
■\Ve  can  diereby  “"gang  sdiedule"  related  processes  wliidi  use  particular  ^bal 
segmcnis  and  svup  out  tlidr  local  and  global  resources  togetlier.  We  can  record 
process  references  to  a  gii'en  global  segment  by  explidt  segment  mappings  and 
by  remote  parcel  accesses.  Statistics  sudi  as  tliese  can  be  recorded  by  eadi  PIM 
node  kemd  and  stored  locally.  Tlie  host  operating  sj-stem  wid  only  need  to 
examine  and  aggregate  tliese  distributed  runtime  statistics  in  tlie  e\'enL  of  a 
rttiuircmcnt  for  a  major  swapping  operation,  sudi  as  plxase  transition  for  a  V’ery 
large  computation. 

bi  general,  tlie  best  must  attend  to  global  diangcs  in  tlie  PIM-based  com- 
putation-  *.e.,  sdieduling  functions,  wlucre  resource  allocation  polides  maj'  be 
altered,  and  to  diores  wliidi  nequire  access  to  external  devices,  sudi  as  sw-ap- 
ping  or  pa^ng  tO  disk.  We  minimiae  tliC  best's  workload,  and  its  potential  for 
saturation,  bj'  requiring  it  to  perform  only  tluOSe  tasks  wliidi  are  ^obal  in  tlieir 
essence. 


7  Local  and  Global  segments 


Section  4  dociibed  liow  local  and  global  segments  are  allocated:  bene  we  consider 
liow^  tlierv'  are  nianaged.  Local  segments  sliould  renvain  fairly  small,  so  tluere  is 
bttle  concern  lliat  portions  of  Uiem  wBl  be  paged  to  didc  during  actl\c  PIM 
execution.  Ratlier,  we  assume  Uiat  most  of  tbe  data  read  or  wTitlcn  by  PIM 
computations  ■will  reside  ui  global  segments. 

Global  segments  provide  a  meclianism  for  dialing  global  data  between  liost 
and  PEM  or  across  PTMs.  Data-intensive  appbcalions  will  liave  a  laige  amount  of 
global  data  tlat  can  easily  exceed  tbe  available  plij-sical  memory^  capacity;  dius. 
it  is  desirable  to  break  up  global  data  into  multiple  ^bal  segments.  Global 
segments  can  be  mudi  larger  tlan  die  dKbvte  page  sioe  of  die  ^vslem.  and  tliere 
can  be  many  more  ^bal  segments  afflodated  with  a  user  program  dian  are 
mapped  to  dic  small  set  of  global  segment  registcis  on  eadi  PIM.  As  a  result, 
data  rcfjuired  by  a  portion  of  tlie  computation  of  tliC  PIM  program  maj'  be 
spread  across  multiple  global  segments:  to  a^xsid  Llirasliing.  care  must  be  taken 
to  map  these  segments  to  pliysical  memorv'  simultaneously  during  tliis  portion 
of  tbe  oamputation.  Tlie  remainder  of  tills  soedon  describes  die  medianisms  for 
organizing  and  mana^ng  data  in  multiple  or  very  large  global  segments. 
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Host  Action 


PIM  Action 


Fig.  7.  S^Tapping  sequoice. 
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7.1  Largo  Clolia]  Sc^mmits 

In.  llie  absence  of  canipiliGT  or  applicaLion-le\'Cl  support  for  defining  segments, 
tiie  operating  sj'siem's  default  belia\ior  is  to  create  one  or  a  small  number  of 
possibly  verj'  largie  global  segments  for  a  user  application  program.  In  tliis  case,  a 
single  segment  can  be  mudi  larger  ilian  tlie  a\'ajlaUG  phj-sical  memorj'  capacity, 
so  tliat  only  a  portion  of  llie  segment  can  reside  in  memorj'  at  a  time. 

Since  eadi  PIM  lias  multiple  global  segment  registers.  c\'en  a  single  global 
segment  can  be  managed  as  multiple  segments  by  lia^ing  distinct  segnient  reg¬ 
isters  mapping  different  portions  of  tlie  segment.  Tliis  approacii  "WOrlcS  welL  for 
example,  if  an  application  is  streaming  tlirougli  its  data  set  sofiuentially.  As  it 
completes  its  accesses  to  data  represented  by  one  segment  register,  it  can  mo\'e 
on  to  data  represented  by  anotlier  segment  register.  Tlie  operating  s>'Stem  and 
PIM  run-time  kernel  can  page  out  die  data  assodated  witli  tlie  former  segment 
registers,  and  redaim  tlie  segment  registers  and  pbj'Sical  memorj-  for  a  subse¬ 
quent  portion  of  tlie  segment. 

7-2  Assigning  Nanios  bo  Global  Sognicnts 

■\Miile  a  single  large  segment  can  be  managed  effectu'ely  for  streaming  appli¬ 
cations.  in  general,  a  more  fle3dble  medianism  is  rOfiuircd  for  organizing  data 
into  niuldple  segments.  For  example,  an  applcation  mi^j-  reNnslL  data  in  different 
pliases  of  a  computation:  or,  one  data  structure  maj-  be  needed  at  die  same  time 
as  anotlier  data  structure  in  one  pliase  of  computation,  and  also  required  at  die 
same  time  as  a  tliird  data  structure  during  a  later  piiase  of  computation. 

Our  approadi  is  to  assign  names  to  segments  as  then,'  are  being  created,  and 
permit  die  compiler  or  application  program  to  optionally  reference  tliese  segment 
names  in  memorj'  allocation  funedons.  For  example,  tine  effect  of  tlie  aUocaUon 
function  Global\failocl}fNodc{rnt  numDyites,  int  xirtualNode,  int  segmentName) 
is  to  allocate  numDyies  from  tlie  named  segment  seffrmrUNamc  on  \nrLual  PIM 
node  vhiwdN'odc.  (Tlie  efiect  of  a  GMf(il^fail<K'IbA^ldrvs£  call  is  to  perform  die 
allocadon  on  tlie  same  -virtual  PIM  node  and  in  tlie  same  global  segment  as 
diat  of  die  specified  address.)  Bj-  abocating  two  objects  from  die  same  gilobal 
segment  diat  are  alwaj-S  iised  togedier,  mc  can  maximize  die  likeliliood  diej- 
mtD  aJ-w'aj-s  be  simultaneously  in  memorj-  wlienev-er  diej-  are  being  accessed.  In 
cases  -ftiiere  grouping  all  related  data  would  result  in  too  large  a  segment,  die 
related  data  must  be  broken  into  muldple  smaller  segments.  Sucli  diat  dieir  size 
more  manageably  maps  to  phj-sical  memorj-,  but  at  die  same  time,  diere  are 
sufficiently  few'  r^ted  segments  so  diat  all  can  simultaneously  map  to  die  small 
number  of  global  segment  registers  on  eadi  PIM. 


7.3  SKaring  Glolial  SSc^numts  across  PIMs5 

As  noted  abov-e  (Secdon  4.2  and  Section  4.3).  mapping  a  ^obal  segment  to  local 
plij-sical  memorj-  pro\idcs  a  medianism  for  effident  sliaring  of  large  blocks  of 
global  data  by  asserting  temporarj-  owmersliip  of  a  local  copy  of  a  data  block 
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UiaL  rricO'  ‘‘liOmcd"  On  anOtii®'  nodiG.  Tlie  home  node  of  a  datum  in  Uie 
global  aiddrcas  space  is  a  function  of  Its  virtual  address,  but  tlie  item  maj-  reside 
on  anotiier  node.  Tlie  home  node  pro\ides  a  central  acoGSS  point  for  die  item 
regardless  of  its  actual  location.  In  tlie  aLsence  of  acth-e  mappings  by  otlicr 
nodes,  a  datum  wdH  be  (reallocated  to  Its  home  node. 

Tlie  slianed  data  blodc  is  created  by  cdtlier  die  liost  or  a  PIM  node  ^-ia  tlie 
GlobcJAfailcc  iunctlons.  Tlie  OlobaiAfaUoc  functions  perform  u-o  distinct  roles: 
allocating  a  block  of  plij’Sical  memon-'  and  mapping  a  portion  of  die  (previously 
resented)  ^obal  N-irtual  address  space  to  diat  plu-slcal  memoiy.  Tlie  GlolMlKfal- 
iocToNodc  function  associates  aUocated  data  with  a  specific  virtual  PIM  home 
node.  Ilow'Ct'er,  if  die  function  is  invoked  by  code  on  a  given  PIM  node,  tlie 
initial  pliV’Sical  memory  aJlocadoai  is  made  on  dial  PIM  node,  wilicli  need  not 
be  die  home  node.  A  virtual-memory  allocation  retiuest  is  sent  to  die  remote 
liome  node,  wiudi  records  die  location  of  diat  data  object  and  returns  a  range 
of  allocated  virtual  addresses  drawm  from  its  virtual  pool.  Tlie  requestor  node 
maps  diat  v'^irtual  address  range  to  die  ply-sical  memory  it  lias  aUocated  from 
its  owm  ply-sical  pooL  Tlic  retiuestor  process  is  tlien  free  to  access  its  instanti¬ 
ated  data  object  at  w^UL  Tlie  process  maj'  terminate  its  use  of  dic  data  object 
by  invoking  eitlier  die  GlobdUnMap  or  GlolxilPnx  funcdon.  C^aUing  GMmjUPtoc 
unmaps  die  object  and  indicates  diat  its  pliv-sical  storage  maj-  be  recyded  and 
its  content  dcstroiv’ed.  GaUing  GlobalUnAfof)  merely  unmapS  diC  object  from  die 
current  process  and  indicates  diat  its  content  diould  pcr^t.  Tlie  object  w^Ul  be 
relocated  to  its  home  node,  wluere  odier  processes  maj-  subsequendy  access  it  by 
calling  due  GIolalAfap  funcdon.  Tlie  object  w^Ul  be  destrovTCd  wdien  seme  process, 
licst  or  PIM.  calls  GMnlPrce  on  it,  or  die  computation  terminates. 

For  ^iplidty,  our  sliaring  model  supports  only  a  single  oopy  of  die  data  and 
win  block  to  enforce  scrializied  access  if  necessary.  AU  access  control  is  serial¬ 
ized  tliiv>u^i  die  liome  node.  In  sudi  a  basic  environment,  careless  use  of  die 
GloUtl^fap  function  can  result  in  deadlock:  diis  is  regarded  as  a  programming 
error. 

Tlie  distributed-sliared-memory  medianism  ou  dined  above  is  intended  for 
simple  blodc-oriented  data  sliaring.  for  appUcadons  wiiene  bandwiddi  is  a  more 
appropriate  metric  dian  latcncv'.  More  flexible  and  finer-grained  access  is  avaU- 
able  via  die  parcel  meclianism.  wiiidi  maj'  be  invoked  ddier  expUddy  widi 
user-mode  access  to  die  network  interface,  or  implidtly,  by  diC  PIM  kmid  in 
response  to  an  access  faulL  Note  dial  our  remote-access  model  permits  access  to 
pordons  of  existing  global  segments  wiiidi  are  not  mapped  to  ply'Slcal  memory 
on  die  local  PIM  node,  at  luglicr  cost. 

8  Summarj^  and  Conclusion 

Tliis  paper  lias  described  die  memory  management  reriuLrements  for  DAA..  a 
PIM-based  ardiitecture  incorporating  PIMs  as  the  only  memorv’  for  a  convien- 
donal  licet  processor.  Two  goals  of  die  DIVA,  paoject  impose  fundamentaUy  new^ 
refiuirements  On  memory'  nranagement:  DIV’A  PIMs  must  perform  pointer  ac- 
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ceases  wUiin  memoT^',  and  tlicj'  must  support  boUi  smaTL-memoij'  functlorLaJily 
as  wcH  as  conventional  memorj'  aoccases.  Tiic  adoption  of  a  globally  sliared  ad¬ 
dress  ^ace  for  botli  host  and  PIM  nodes  allov^s  free  use  of  pointer-based  dau 
structures.  Careful  partitioning  of  compleoc  nuemon'-nianageniient  tasks  sucli  as 
pa^ng  and  swapping  betwoen  die  host  and  PIM  node  kjcmel  allows  a  sin^  host 
processor  tO  Supcrt'isc  many  PlM  nodes  witliOut  0\’erload. 

Acknmvlcdifmcnis.  Tlie  autliors  wisli  to  tliank  tlie  DIV'A  reseandi  group  at  Uni- 
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Abstract 

Processing-in-moinory  ( PI M  i  chips  that  inlcgrale  processor  logic  into  memory  devices  offer  a  new  opportunity 
for  bridging  the  growing  gap  between  processor  and  memory  speeds,  especially  for  applications  with  high 
memory-bandwidth  requirenients.  fhe  Data-IntensiVe  .'\rchitecture  (DIV,\)  system  combines  I’lM  memories 
w  ith  one  or  more  external  host  processors  and  a  I’lM-to-l’IM  interconnect.  DIVA  increases  memory  bandwidth 
through  two  mechanisms:  ( I )  perfomiing  selected  computation  in  memory,  reducing  the  quantity  of  data  trans¬ 
ferred  across  the  processor-memory  interface;  and  <  2)  providing  communication  mechanisms  called  parcels  for 
moving  both  data  and  computation  throughout  memory,  further  bypassing  the  processor-memory  bus.  DIVA 
uniquely  supports  acceleration  of  important  irre^ilar  applications^  including  sparse-matrix  and  pointer-based 
computations.  In  this  paper,  we  focus  on  several  aspects  of  DIVA  designed  to  etfeclively  support  such  compu¬ 
tations  at  very  high  performance  levels:  ( 1 )  the  memory  model  and  parcel  definitions;  (2)  the  PIM-to-PIM  inter¬ 
connect;  and,  (.■?)  requirements  for  the  processor-to-memory  interface.  We  demonstrate  the  potential  of  PIM- 
based  architectures  in  accelerating  the  performance  of  three  iiregular  computations,  sparse  conjugate  gradient, 
a  natural-join  database  operation  and  an  object-oriented  database  query. 

1.0  Introduction 

The  increasing  gap  between  processor  and  memory  speeds  is  a  well-known  problem  in  computer  architecture, 
w  ith  peak  processor  performance  increasing  at  a  rate  of  60“  o  per  year  w  hile  memory  access  times  improve  at 
merely  7%.  To  mask  memory  latency  in  current  high-end  computers  now  demands  up  to  25  times  the  number 
of  overlapped  operations  required  of  supercomputers  .to  yearn  ago.  Further,  techniques  designed  to  hide  mem¬ 
ory  latency,  such  as  multithreading  and  prefetching,  actually  increase  the  memory  bandw  idth  requirements 
|Burger‘)6|.  Recent  VLSI  technology  trends  offer  a  promising  solution  to  bridging  the  processor-memory  gap: 
integrating  processor  logic  and  memory  in  a  processing-in-memory  (PIM)  chip.  Because  I’lM  internal  proces¬ 
sors  can  be  directly  connected  to  the  memory  banks,  the  memory  bandwidth  is  dramatically  increased  (up  to  2 
orders  of  magnitude,  tens  or  even  hundreds  of  gigabits  aggregate  bandw  idth  on  a  chip).  Latency  to  on-chip 
logic  is  also  reduced,  dow  n  to  as  little  as  one-fourth  that  of  a  conventional  memory  system,  because  internal 
memory  accesses  avoid  the  delays  associated  with  communicating  off  chip. 

fhe  Data-IntensiVe  .Architecture  (DIVA)  project  is  developing  a  system,  from  VLSI  design  through  system 
architecture,  systems  software,  compilers  and  applications,  to  take  advantage  of  this  technology  for  applica¬ 
tions  of  grow  ing  importance  to  the  high-performance  computing  community.  DIVA  combines  PIM  memory 
chips  with  one  or  more  external  host  processors  and  a  PIM-to-PIM  interconnect  ( see  Figure  1 ).  Within  a  single 
PIM  chip,  we  observe  dramatic  improvements  in  and  bandwidth  and  significant  reductions  in  memory  latency. 
But  a  more  important  elTect,  and  a  distinguishing  feature  of  DIVA,  is  the  coupling  of  increased  opportunitv  for 
concimency  with  a^gre^ate  processor-memory  bandw  idth  increases.  Multiple  memory  chips  can  work  in  par¬ 
allel  on  independent  data,  and  perform  PIM-to-PIM  communication  without  going  through  the  processor- 
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memory  bus. 

An  obvious  class  of  applications  well-suilod  to  PIM  technology  is  re^iular  —  dense-matrix  computations  on 
large  amounts  of  data  that  are  “embarrassingly  parallel,"  such  as  image  processing.  While  good  candidates  for 
DIVA,  such  applications  also  perform  well  on  conventional  systems.  In  this  domain,  locality-exploiting  archi¬ 
tecture  features  tsiich  as  long  cache  lines  and  vector  units)  and  compiler  optimizations  (such  as  tiling 
[Wolfe8‘)|),  and  techniques  for  hiding  latency  (such  as  prefetching  |Mowry92|)  are  etfective  because  such 
applications  exhibit  significant  data  reuse,  and  compilers  are  able  to  predict  their  memory  access  requirements. 


I'i<;iirc  1:  DIVA  System  ()r«aiiizati()n. 


I  his  paper  argues  the  effectiveness  of  DIVA  for  a  completely  different  class  of  applications:  irregular,  sparse- 
matrix  and  pointer-based  computations  with  high  processor-memory  bandwidth  requirements  (t'.g.,  sparse  con¬ 
jugate  gradient  and  database  applications).  Such  applications  perfonn  poorly  on  conventional  architectures 
because  their  control  and  data  accesses  cannot  be  statically  predicted,  and  (hey  do  not  make  effective  use  of 
cache.  .Vs  a  result,  their  execution  is  dominated  by  waiting  for  memory  accesses  |C'ar(ei'‘)^)|.  DIVA  can  acceler¬ 
ate  the  perfomiance  of  such  applications  by  eliminating  much  of  the  memory  tralTic  —  simple  operations  and 
dereferencing  can  be  done  in  situ  rather  than  laboriously  moving  data  around  the  system.  In  addition  to  the 
reduction  in  memory  latency  for  each  access,  theie  is  potential  for  coarse-grain  parallelism  across  multiple 
PIM  chips.  Performance  improvements  also  result  from  secondary  effects  such  as  reduced  host  cache  and  11. B 
pollution  because  iiregular  accesses  no  longer  need  be  brought  into  the  host  processor  cache. 

While  several  PIM-based  architectures  have  been  proposed  in  recent  years,  the  DIVA  project  differs  from  other 
efforts  in  several  ways,  fhere  are  two  distinct  advantages  to  using  PlMs  as  smart-memory  coprocessors  to  one 
or  more  external  hosts:  (I)  DIVA  pemiits  augmenting  conventional  systems  in  general-purpose  computing 
environments;  and,  (2)  applications  can  be  gradually  migrated  from  set]uential  versions  that  use  DIVA  PlMs  as 
"dumb"  memory  toward  fully  exploiting  smart-memory  capabilities  and  parallel  in-memoiy  execution.  .Vt  the 
same  time,  this  co-processor  model  imposes  fundamentally  new  requirements  on  the  system  software  and 
interfaces.  Supporting  in-memoiy  pointer  accesses  requires  a  new  memory  model,  including  a  mechanism  for 
address  translation  w  ithin  memory.  We  also  rely  on  the  parcel,  a  mechanism  for  communicating  computation 
to  memory,  either  from  a  host  or  a  PIM  processor.  DIVA  also  requires  the  host-to-memory  interface  be  aug¬ 
mented  because  memoiy  must  now  be  able  to  communicate  w  ith  the  processor  for  synchronization,  exceptions, 
to  w  arn  of  high-latency  events,  etc. 

The  primary  contributions  of  thiis  paper  are  as  follows: 

•  the  first  description  of  the  DIVA  architecture. 

•  the  first  presentation  of  system  requirements  for  in-memory  processing  of  irregular  data  structures. 

•  a  detailed  description  of  how  to  map  applications  to  a  PIM-based  architecture,  with  two  case  studies 
fiom  important  iriegular  computations. 

I  he  remainder  of  the  paper  is  organized  into  five  main  sections  and  a  conclusion,  fhe  next  section  discusses 
background  and  previous  work.  Section  .V  presents  the  system  architecture,  particularly  the  PlM-to-PIM  inter¬ 
connect.  Section  4  discusses  the  rec]uirements  imposed  on  the  system  software  and  interfaces.  Section  5  pre- 
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sonts  tho  DIVA  memory  model.  In  Section  6.  we  describe  how  a  user  applicalion  can  be  developed  for  DIVA, 
leveraging  existing  approaches  from  parallel  programming.  Section  7  presents  three  case  studies  of  irregular 
computations  from  scientific  and  database  computations;  we  present  system-level  simulation  results  to  demon¬ 
strate  the  potential  of  PlM-based  systems  at  achieving  improved  perfomiance  on  these  applications. 

2.0  Backuroiind  and  Related  Work 

I  he  concept  of  mixing  memory  and  logic  closer  than  in  a  CPU-Memory  dichotomy  is  an  old  one.  The  DAPP, 
S  IAR.\N,  C'M-2,  and  (i.\PP  all  used  many  relatively  small  data  Hows  positioned  very  close  to  memory  arrays 
to  implement  very  large  SIMD  machines  (all  with  multiple  data  flows  per  chip).  .\t  least  one  such  chip,  the 

1  |{R.\SYS  |Ciokhaled5|,  was  fabricated  in  relatively  large  volumes,  and  targeted  as  the  main  memory  for  one 
of  the  later  Cray  machines.  1  his  grew  into  more  or  less  single  chip  systems  which  contained  a  CPU,  some 
memory,  and  I  O  with  machines  like  the  INMOS  Transputer  IKnowlesOlj,  the  nCUBh  |Palmer86|,  the  J- 
machine  |Dallyd2|,  and  the  Sll.ARC  (www.analogdevices.com).  While  these  latter  chips  could  scale  to  large 
anays,  their  system  architecture  was  a  relatively  conventional  MPP  of  some  form.  1  he  first  DR  AM-based  mul¬ 
tiple  node  PIM  chip  was  liXIiCUBIi,  fabricated  in  10d2  and  supporting  a  3D  binary  hypercube  MIMD  SIMD 
MPP  on  a  single  chip  |K.ogged4||Sunaga%|.  .\  more  recent  chip  is  the  Mitsubishi  M32  R  D,  where  more  than 

2  MB  of  memory  is  tightly  tied  into  the  on-chip  CPU’s  cache  |Shimi/u%|. 

What  stopfV’d  all  these  designs  from  becoming  mainstream  architectures  is  very  -  memory  density.  Parly 

PIM-like  devices  used  SR.\M  for  memory,  and  even  with  relatively  primitive  MOS  technology,  it  was  quite 
easy  to  put  more  processing  power  on  a  single  chip  than  the  on-chip  data  storage  could  feed.  rule  of  thumb 
for  scientific  computing  is  that  one  byte  of  storage  for  each  FLOP  provides  a  good  system  balance.  Taking  any 
of  the  previously  discussed  machines  and  computing  the  ratio  of  on-chip  memory  to  performance  (using  what¬ 
ever  metric  of  performance  the  chip  was  designed  for  -  usually  not  even  floating  point),  the  ratios  are  uniformly 
0.0001  or  worse.  Fven  the  FXFCUBF  chip  had  a  storage  to  performance  ratio  of  only  0.01.  The  chips  were 
uniformly  memory  starved,  requiring  designs  which  included  ports  to  off-chip  memory. 

fhis  began  to  change  around  B)‘)7,  when  DRAM  chips  with  densities  greater  than  32  Mbits  began  to  appear. 
.At  this  density,  a  reasonable  ratio  of  storage  to  processing  can  be  achieved;  for  example,  an  entire  video  frame 
buffer  can  fit  in  one  chip,  along  w  ith  logic  to  perform  processing  on  it.  With  current  C  MOS  projections,  in  a 
few  years  a  single  memory  chip  will  contain  more  than  enough  memory  capacity  for  a  conventional  PC.  file 
realization  that  complete  systems  can  now  be  placed  on  a  single  chip  has  led  virtually  every  major  semiconduc¬ 
tor  manufacturer  to  offer  some  form  of  an  embedded  DR.AM  macro  that  can  be  coupled  w  ith  other  predefined 
logic  macros.  .At  least  one  industrial  organization  has  sprung  up  to  help  set  standards  to  enable  such  systems 
|BirnbaunT)‘)|. 

While  the  technology  has  finally  developed  to  the  point  of  reasonable  systems,  architectures  which  take  dis¬ 
tinct  advantage  of  the  new  capabilities  have  only  recently  come  under  serious  study.  In  addition  to  the  Mitsub¬ 
ishi  M32  R  D,  the  IR.AM  is  another  system-on-a-chip  embedded  DR.AM  device  with  vector  processing  logic, 
designed  for  streaming  computations  [Patterson97|.  Other  approaches  use  PIM  devices  as  the  only  processors 
in  a  multiprocessor  architecture:  a  cache-coherent  distributed-shared-memory  system  |Saulsbury96l,  and  a 
large-scale  distributed-memory  system  [K.ogge96|.  1  he  .Active  Pages  project,  which  is  the  most  closely  related 
to  DIV.A.  associates  configurable  logic  w  ith  each  memory  page  to  accelerate  perfomiance  of  an  external  host 
IOskin‘)8|. 

I  here  are  also  several  other  architecture  approaches,  not  based  on  PIM  technology,  designed  to  improve  pro¬ 
cessor-memory  bandwidth  [Carter99||Burger97||Rixner98|.  Impulse  augments  the  memory  system  to  perform 
application-specified  scatter  gather  operations  on  irregular  data  in  the  memory  controller,  so  that  contiguous 
data  is  brought  into  the  cache  |Carter99|.  Imagine  is  a  system-on-a-chip  streaming  architecture  designed  for 
media  applications,  which  uses  a  stream  programming  model  |Rixner98|.  fhe  DataScalar  architecture  is  a  mul¬ 
tiprocessor  system  where  each  prex'essor  asynchronously  executes  the  same  code  and  broadcasts  any  local  data 
to  the  other  processors  |Burger‘)7].  DIVA  is  distinguished  from  these  approaches  as  it  supports  a  wide  variety 
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or  parallel  proj;ramming  models;  DIVA  IMMs,  with  the  appropriate  interconnect,  can  he  used  in  a  scalable  sys¬ 
tem  with  an  unlimited  number  of  chips,  not  just  single  chip  solutions. 

The  DIVA  architecture  and  the  material  presented  in  this  paper  is  distinguished  from  these  previous  approaches 
in  several  ways:  ( I )  unlike  mosl  of  these  other  approaches,  we  consider  an  architecture  where  smart  memory  is 
optionally  used  to  improve  performance  of  a  standard  host  processor;  (2i  we  develop  a  system  that  can  support 
in-memory  manipulation  of  both  regular  and  irregular  data  structures;  and,  (2)  we  consider  the  requirements 
imposed  on  the  system  architecture  and  system  software  for  mapping  application  execution  between  host  and 
memory. 


3.0  Oven  iew  of  DIVA  System  Architecture 

In  I'igure  I,  we  show  a  small  set  of  PI  Ms  connecled  to  a  single  exlernal  host  through  a  host-memory  interface; 
through  this  interface  the  host  processor  performs  standard  reads  and  w  rites,  augmented  as  discussed  in  Section 
-■'..'t.  fhe  PIM  chips  communicate  through  separate  PIM-to-PIM  channels  to  bypass  the  system  bus  with  addi¬ 
tional  memory  traffic  from  parcels  used  to  spawn  computation,  gather  results,  s\  nchroni/e  activity,  or  simply 
access  non-local  data,  fhe  separate  interconnect  is  provided  because  PIM-to-PIM  communication  requires 
greater  bandw  idth  than  can  be  achieved  w  ith  a  conventional  memory  bus. 

3.1  PIM  Component 

.-\  PIM  is  a  VLSI  memoiy  de\ice  augmented  with  general  and  special-purpose  compuling  hardware.  .A  PIM 
may  consisl  of  mtilliple  nodes,  each  of  w  hich  are  comprised  of  a  few  megabytes  of  memory  and  a  node  proces¬ 
sor.  fhe  inset  in  Figure  I  shows  a  PIM  w  ith  four  nodes,  fhe  nodes  on  a  chip  share  resources  for  communica¬ 
tion  with  the  rest  of  the  system.  .\sa  result  each  chip  contains  a  single  PIM  Routing  Co-processor  (PiRC)  and 
a  host  interface.  We  anticipate  that  DIVA  PIMs,  like  many  olher  PIM  chips,  will  be  split  roughly  60“u  memory 
and  4()''u  logic  (reflecling  the  importance  of  memory  density). 

Within  a  single  node,  shown  in  Figure  2.  the  processing  logic  consists  of  a  standard  scalar  microprocesor 
including  a  lloating-poinl  tmil  and  a  special  DIVA  funclional  unit  called  an  At-the-Sense-Amps  Processor 
(ASAP).  Fhe  key  idea  behind  the  .\SAP  is  to  perform  wide  operations  on  aggregate  objects  stored  within  a  row 
of  the  local  memory  array.  Rather  than  selecting  a  .'12-bit  object  from  the  row  as  is  done  with  conventional  sca¬ 
lar  processing,  the  .\S.AP  unit  operates  on  up  to  256  bits  in  a  single  processor  cycle,  fhis  fine-grain  parallelism 
olTers  additional  opportunily  for  exploiting  the  increased  processor-memory  bandwidth  available  in  a  PIM. 
fhe  AS.\P  unit  can  be  used  to  perform  bit-level  operations  such  as  simple  pattern  matching,  or  higher-order 
computations  such  as  searches,  limited  pointer  chasing,  and  associative  and  commutative  reduction  operations. 
Details  on  a  related  wide-word  unit  are  discussed  elsewhere  IBrockmanOO]. 
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3.2  I’lM  InkTcoiiiicction 

Wo  anticipato  PIM  chips  to  ho  physically  grouped  as  convontional  momory  chips,  nuuintod  on  DIMM  modules, 
as  shown  in  Figure  3.  Bounded  by  hosl  bus  loading  constraints,  the  number  of  PIM  chips  in  a  hosted  cluster  is 
in  the  range  of  32  to  (4  chips,  depending  on  how  many  PIM  chips  can  be  packed  onto  a  DIMM  module.  The 
PIM-to-PlM  interconnect  must  then  be  amenable  to  the  dense  packing  requirement  of  DIMM  modules.  Obvi¬ 
ously,  low  latency  and  high  bandw  idth  are  also  desirable  properties  of  this  interconnect.  Furthermore,  this  net- 
work  must  be  scalable  to  allow  the  addition  or  removal  of  modules  from  the  system.  Phis  combination  of 
requirements  favors  a  one-dimensional  network.  .Although  higher-dimension  networks  offer  lower  network 
diameters,  they  are  not  easily  scalable  in  all  dimensions,  especially  in  a  densely  packaged  system.  .Mso,  the 
dense  packing  achievable  with  one-dimensional  networks  allows  more  data  signals  per  channel.  Hence,  the 
slightly  larger  distances  (in  hops)  of  message  traversals  in  a  32-  or  PT-hop  one-dimensional  netw  ork  are  com¬ 
pensated  by  shorter  messages  (in  flits).  Furthermore,  router  cycle  times  are  faster  in  one-dimensional  network 
routers  because  of  simpler  switching  decisions. 

fhe  PIM  interconnect  requirements  closely  resemble  those  of  interconnect  in  embedded  scalable  systems.  We 
therefore  use  the  interconnection  network  of  one  such  system,  the  Package-Driven  Scalable  System  (PDSS) 
[Steele‘)7|,  as  a  model  for  designing  the  DIV.V  PIM  interconnect,  fhe  DIVA  PIM  interconnect  is  then  a  point- 
to-point  bidirectional  ring  using  wormhole  routing  and  the  Red  Rover  routing  algorithm  IDraperdo]  to  etfect 
deadlock-free  routing.  It  routes  llxed-sized  packets  and  uses  source  routing  to  achieve  low  latency,  fhe  inter¬ 
connect  is  implemented  by  PIM  Routing  Co-processor  (Pi RC)  devices  -  one  per  PIM  chip. 

Later  generations  of  DIVA  systems  are  envisioned  to  contain  hundreds  and  even  thousands  of  PIM  chips. 
Clearly,  the  advantages  of  a  flat  ring  topology  do  not  extend  to  systems  of  this  size.  .A  more  complex  network 
scheme  w  ill  he  needed.  One  possibility  is  another  level  of  interconnect  for  connecting  host  PIM  clusters.  To 
provide  adequate  aggregate  bandw  idth,  this  higher-level  interconnect  w  ill  have  to  employ  channels  w  ith  greater 
bandw  idth  than  those  of  the  PIM  chips,  fhe  details  of  these  channels  are  beyond  the  scope  of  this  paper. 

PIM-l’IM  Coiiinuiiiication  Channels  Off-Nltxliilc  Channel  Coinicctor 


I’igiirc  3:  PIM  DIMM  Module  Oroaiii/ation. 


4.0  Required  Mechanisms 

We  now  present  a  collection  of  key  mechanisms  in  DIVA. 

4.1  Parcels 

.A  parcel  is  the  general  mechanism  for  coordinating  computation  in  memory,  communicating  data  and  perform¬ 
ing  synchronization  across  components  of  the  DIVA  system,  a  refinement  of  the  parcel  concept  described  pre¬ 
viously  |Brocknum‘)‘)|.  Similar  to  an  active  message  |vonFicken921,  a  parcel  incorporates  data  and  an  encoded 
operation  to  apply  to  the  data;  a  parcel  is  directed  to  a  memory  object,  not  a  process  or  processor.  .\  parcel  has 
the  following  four  fields: 

•  picl:  indicates  which  process  issued  the  parcel. 
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•  object:  the  virtual  address  of  the  primary  object  the  parcel  will  modify  or  access,  used  for  routing  the 
parcel. 

•  commamt:  an  integer  encoding  the  action  to  bo  performed,  which  may  refer  to  a  compiled  function 
stored  on  the  IMM. 

•  arguments:  (other  than  object),  specified  as  virtual  addresses. 

.\n  obvious  requirement  on  parcels  is  small  size,  to  prevent  overloading  the  host-to-memory  interface  and 
PIM-to-PIM  interconnect.  In  DIV.\.  we  expect  a  single  packet  to  consist  of  a  header  and  256  bits  of  payload.  .\ 
parcel  requiring  more  bits  must  be  sent  in  multiple  packets.  A  related  requirement  is  that  processing  parcels 
must  be  efficient  (see  5.2. 1  >. 

In  addition,  protection  must  be  provided  on  arguments,  pid  and  command  fields;  the  protection  on  memory 
accesses  cannot  rely  on  standard  host  mechanisms  as  the  parcels  pass  virtual  rather  than  physical  addresses  to 
the  memory.  .\lso,  the  order  of  parcel  processing  must  preserve  sequential  semantics,  but  parcel  execution 
should  be  overlapped  to  exploit  parallelism.  To  accomplish  these  goals,  we  employ  optional  sequence  numbers 
on  parcels  when  a  specific  ordering  of  processing  is  required. 

4.2  Host-Memory  Intertiu-e 

In  the  initial  DIV.A  prototype,  an  underlying  assumption  is  that  DIV.A  PIM  devices  can  also  serve  as  conven¬ 
tional  memory,  so  that  they  can  he  used  as  smart-memory  coprocessors  in  a  standard  system,  for  this  reason, 
the  PIM  VLSI  device  is  being  designed  with  a  host  interface  consistent  with  the  standard  memory  interface 
typical  of  commercial  memories.  Phis  enables  PIMs  to  be  packaged  in  the  form  of  DIMM  modules  with  provi¬ 
sions  for  top-plane  interconnections  to  support  the  PIM-to-PIM  communication  fabric.  However,  unlike  com¬ 
mercial  memories,  computation  activities  give  rise  to  new  problems:  how  to  communicate  internal  exceptions 
and  possible  memory  busy  conditions  to  the  host  system,  fhese  issues  are  being  addressed  as  part  of  the  larger 
system  architecture. 

5.0  Memory  Model 

Systems  with  smart  memory  resemble  both  uniprocessors  (or  small  SMPs)  with  large  memory,  and  large,  het¬ 
erogeneous  multiprocessors.  The  semantics  are  made  precise  by  the  DIVA  memory  model,  developed  from  the 
following  list  of  requirements: 

•  a  simple  virtual  machine  for  Kith  programmers  and  compiler  writers; 

•  application-level  visibility  and  control  of  data  placement; 

•  high  overall  performance; 

•  scalability  to  many  PIM  chips,  larger  PIM  chips,  and  multiprocessor  hosts; 

•  compatibility  with  conventional  memory'  models  and  memory  interfaces; 

•  .support  for  virtual  memory  (i.e..  paging  to,  from  disk);  and. 

•  a  host-independent  PIM  chip  architecture. 

fhese  reciuirements  look  ahead  toward  future  uses  of  PIM  chips,  augmenting  all  sorts  of  systems  and  used  to 
accelerate  all  sorts  of  applications,  both  at  the  small  and  large  scale. 

5.1  Parcel  Buffers 

for  high  performance,  applications  must  communicate  with  PIM  chips  without  invoking  the  host  operating 
system.  .\  conventional  memoiy  interface  supports  this  naturally,  but  cannot  generally  guarantee  atomicity  or 
ordering  when  caching  and  write  buffers  exist.  Ivach  PIM  chip  therefore  has  a  second  intelligent  interface,  the 
Parcel  Buffer,  which  is  mapped  into  each  process  as  a  (roughly)  parcel-sized  piece  of  SR.\M.  Hie  host  OS 
ensures  each  process  uses  a  ditferent  physical  address  for  the  multiply-mapped  butfer,  so  the  interface  can 
identify  the  source  of  each  transaction.  Ilaidware  in  the  interface  transparently  manages  ownership  of  the 


333 


hulTor  iisini:  a  wait-free  protocol  |llerlihy9|  |  that  can  he  implemented  simply  at  the  application  level  w  ithout 
supervisor  state  interactions;  this  interface  hardware  ensures  that  access  patterns  are  grammatically  correct.  To 
communicate  a  parcel,  a  process  reads  or  writes  fields  in  the  hiiffer.  then  performs  a  final  read  on  a  status  field 
to  pass  the  parcel  to  the  PIM  chip  internals.  In  the  rare  case  of  corrupted  accesses,  a  failure  status  is  returned, 
and  the  application  can  retry. 

5.2  .Address  Iruiislutidn 

Parcels,  application  code  and  data  contain  virtual  addresses,  for  PIM  processors  to  inteipret  these,  they  must 
have  access  to  translation  information,  or  there  must  be  some  fixed  relationship  between  virtual  and  physical 
addresses,  fhe  latter  option  is  simpler  to  implement,  but  was  detennined  to  be  too  restrictive,  fach  PIM  thus 
contains  translation  hardware,  and  tables  managed  by  the  host.  .Any  virtual  page  can  reside  on  any  PIM.  How¬ 
ever,  the  hardware  is  simplified  by  the  characteristics  of  the  system,  for  instance,  for  performance,  a  I’lM 
needs  to  be  able  to  rapidly  determine  if  an  address  is  local  to  its  own  memory  bank,  and  find  the  physical 
address  if  it  is.  1  lowever.  if  the  addre.ss  is  not  local  and  communication  is  required,  the  additional  cost  of  the 
non-local  translation  is  negligible. 

liach  PIM  therefore  maintains  translations  for  those  virtual  pages  currently  residing  on  it,  plus  part  of  a  global, 
distributed  table  (similar  to  a  home  node  concept  as  presented  in  |Saulsburyh5|).  Non-local  translations  are 
obtained  by  querying  the  distributed  table,  or,  et]tiivalently,  submitting  the  virtual  address  in  a  parcel,  for  for¬ 
warding  to  the  PIM  where  it  resides.  .Advantages  of  this  approach  are  that  the  translation  tables  on  each  PIM 
scale  well;  every  address  can  be  accessed  in  at  most  two  parcel  transmissions,  and  the  application  can  option¬ 
ally  maintain  location  hints  and  use  them  to  reduce  this  to  a  single  parcel  transmission  in  performance-critical 
cases. 

5J  PIM  Mciiiorv  Orfiuiiizatioii 

fhe  DRAM  in  the  PIM  subsystem  is  the  primary  storage  for  the  DIV.A  system,  and  can  be  treated  physically  as 
a  uniform,  undifferentiated  R.AM.  However,  during  operation  the  system  uses  the  memory  in  three  distinct 
ways,  making  it  helpful  to  organize  the  memory  on  each  PIM  node  logically  into  three  regions  according  to 
whether  it  is  used  primarily  by  the  host  processor,  primarily  by  the  PIM  processor,  or  significantly  by  both. 
I'hese  regions  may  be  either  physically  contiguous  or  interspersed,  and  memory  allocation  within  these  regions 
can  either  be  initiated  by  explicit  system  calls  in  the  application,  or  undertaken  at  load  time  for  all  applications 
by  the  loader  or  start-up  code.  A  llexible  combination  of  static  and  dynamic  allocation  is  usually  most  conve¬ 
nient  for  the  user,  but  for  this  discussion  assume  explicit  system  calls  are  used. 

.An  adsantage  of  making  this  distinction  is  that  different,  optimized  memory-management  hardware  can  be 
used  on  each  of  the  regions.  .As  modern  processor  architectures  demonstrate! I BMMotdd],  there  is  no  concep¬ 
tual  problem  with  having  multiple  translation  mechanisms  in  place,  as  long  as  they  provide  consistent  virtual- 
to-physical  mappings  and  access  permissions. 

Dumb  Memory:  Initially,  the  application  is  a  normal  (say  Unix)  process  on  the  host,  fhe  \  arious  regions  of  its 
virtual  address  space  (typically  the  user  code,  heap  and  stack  and  one  or  more  kernel  segments)  are  mapped  as 
usual  to  some  set  of  pages  in  DR.\M,  with  some  possibly  paged  out  to  disk.  If  the  system  memory  contains 
both  ordinary  DR.kM  and  PIM  l)R.\M,  these  normal  pages  can  be  mapped  into  the  ordinary  DR.AM,  since 
they  are  never  directly  accessed  by  PIM  processors.  If  all  memory  is  PIM  memoiy,  the  system  can  simply  note 
that  these  pages  are  only  accessed  by  the  host,  and  that  they  need  not  appear  in  PIM-processor  translation 
tables.  .A  major  use  of  dumb  memory  w  ill  be  application  code  for  the  host  C  PU,  w  hich  is  meaningless  to  the 
PIM  processors;  also,  many  host  processes  will  never  require  PIM  services  at  all,  and  will  remain  in  this  con¬ 
figuration. 

Inteniiil  Memoiy  ;  If  an  application  elects  to  use  the  PIM  processing,  the  first  step  is  to  allocate  and  initialize 
a  region  of  memory  on  each  node  to  be  used  by  that  node  for  its  local  processing  needs.  I  hese  include:  a  small 
run-time  kernel  for  parcel  management,  synchronization  and  exception  handling;  code  for  the  application-level 
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melhods  supported  hy  the  1*1  M;  and  storage  lor  executing  PIM  programs  such  as  hiilTers  and  stacks. 

In  practice,  elTiciency  dictates  whether  this  initiali/ation  step  occurs  at  host  boot  time,  application  load  time, 
during  application  start-up,  follow  ing  an  explicit  system  call,  or  transparently  w  hen  the  first  I’lM  operation  is 
attempted;  some  combination  of  these  initiali/ation  steps  can  be  profitably  supported.  For  instance,  a  basic 
I’lM  kernel  could  be  installed  on  each  node  at  system  boot  time,  as  could  code  for  any  widely  useful  PIM 
methods,  .kpplicalion  load  time  is  a  good  time  to  install  application-specific  method  code  that  is  used  fre¬ 
quently;  individual  methods  from  a  large  system  library  could  be  loaded  dynamically  on  demand  during  appli¬ 
cation  execution. 

User-level  code  on  the  host  never  accesses  this  internal  memory  during  normal  operation.  To  the  host,  the  inter¬ 
nal  pages  appear  within  the  supervisor  region,  like  the  kernel  and  its  associated  data  structures.  Moreover,  the 
host  only  needs  to  access  them  under  exceptional  conditions,  e.}’.,  application  loads,  service  requests,  and 
eiTors.  Access  from  the  host  is  thus  guaranteed  to  be  infrequent,  through  trusted  code  w  ith  access  to  translation 
tables.  On  the  other  hand,  access  to  internal  regions  by  the  PIM  processor  needs  to  be  highly  etficient  and  well 
protected,  since  it  is  used  for  everything  from  local  OS  code  and  data  to  execution  slacks  and  w  orking  memory 
for  the  many  light-weight  user-level  methods  launched  in  response  to  parcels  during  normal  operation. 

One  can  exploit  these  asymmetric  requirements  by  adopting  a  memory-management  approach  for  the  internal 
memory  that  is  very  convenient  for  the  PIM  processor,  but  perhaps  quite  unrelated  to  the  memory-management 
hardware  on  the  host.  .\  particularly  useful  scheme,  planned  for  the  prototype,  is  to  give  each  lightweight  local 
context  on  a  PIM  processor  eight  variable-si/ed  segments  or  pages  of  internal  memory,  each  defined  by  virtual 
and  physical  base  addresses,  si/e  and  access  permissions.  By  convention,  these  are  assigned  to  the  following: 

1.  Supervisor-level  kernel  code  (shared  by  all  contexts  on  the  node) 

2.  Supervisor-level  kernel  data  and  stack  (shared  by  all  contexts  on  the  node) 

.1.  User-level  code  (shared  by  all  contexts  in  the  same  application) 

4.  User-level  data  (shared  by  all  contexts  in  the  same  application) 

5.  User  stack  (unique  to  each  context) 

6.  Miscellaneous  (possibly  unique  to  each  context) 

7.  Supervisor-level  parcel  biitfer  device  (shared  by  all  contexts  on  the  node) 

8.  User-level  parcel  buffer  device  (shared  by  all  contexts  in  the  same  application). 

franslation  of  intei  nal  virtual  addresses  can  be  made  extremely  fast  and  efficient  by  adopting  some  simple  con¬ 
ventions,  e.g.,  high  bits  of  all  the  page  viniial  starting  addresses  are  the  same,  the  next  three  bits  specify  the 
page  number,  and  the  si/e  is  a  power  of  two.  I  hen,  the  fl  .B  simplifies  to  a  look-up  table,  the  translation  infor¬ 
mation  for  a  lightweight  context  fits  into  256  bits,  and  can  be  switched  in  one  clock  cycle.  Since  PIM  nodes  do 
not  access  each  other’s  internal  memory,  the  same  virtual  address  range  can  be  used  for  internal  memory  on 
every  node,  making  PIM  contexts  relocatable  from  one  node  to  another. 

(dobal  .Memory:  1  he  next  step  in  setting  up  to  use  the  PIM  features  is  to  allocate  DR.\M  on  each  PIM  for  use 
as  smart  “global"  storage.  This  can  be  done  at  nm  time  by  a  series  of  system  calls  such  as 
mein_alloc  (piin_node,  virtual_address ,  size) ,  which  allocates  a  region  of  memory  of  size 
bytes  on  piin_node  and  maps  it  at  virtual_address.  an  unmapped  virtual  address  range  within  the 
application  address  space.  Unlike  the  dumb  memory,  whose  mapping  is  visible  only  to  the  host  process,  or  the 
internal  memory,  whose  mapping  is  visible  only  to  the  associated  PIM  node  and  the  host  OS,  the  global  mem¬ 
ory  is  visible  to  the  host  process  and  to  all  PIM  nodes  involved  in  the  application.  .Mthough  only  the  host  and 
the  local  node  w  here  the  data  resides  can  access  an  element  of  global  memory  directly  (i.e.,  by  read  and  w  rite 
instructions),  pointers  to  global  objects  are  meaningtui  to  all  nodes  and  to  the  host,  and  can  be  communicated 
freely  w  ithin  parcels.  Once  global  memory  has  been  allocated,  the  host  process  can  set  up  any  initialized  data 
by  writing  to  it.  In  practice,  global  memory  will  make  up  the  majority  of  memory  in  a  i.kita-intensive  applica¬ 
tion  using  the  PIM  features.  It  is  important  that  access  to  this  memory  be  efficient  from  both  the  PIM  and  the 
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host,  and  this  therefore  presents  the  greatest  implementation  challenge.  .Vddress  translation  must  he  compatible 
« ith  both  host  CPU  and  PIM  hardware,  fhe  page  si/e  must  therefore  be  equal  to  or  a  multiple  of  the  hardware- 
supported  page  si/e  of  the  host  CPI.'.  .-\lso,  each  PIM  node  should  ideally  be  able  to  hold  in  a  fast  TUB  the 
translation  information  for  all  active  global  pages  resident  on  it.  to  avoid  the  frequent  I  I  B  misses  that  would 
occur  on  an  irregular  application.  I  his  therefore  suggests  that  global  memory  pages  should  be  large;  in  the  pro¬ 
totype,  one  simplifying  option  under  consideration  is  a  single  very  large  global  page  (per  application)  on  each 
node. 


Parcel  Buffers:  The  final  step  in  invoking  the  PIMs  is  to  request  the  host  OS  to  allocate  and  map  one  or  more 
virtual  parcel  buffers,  for  use  in  communicating  parcels  with  the  PIM  system.  Parcels  are  then  sent  to  individ¬ 
ual  nodes  to  start  the  PIM  compulation.  Finally,  when  the  computation  is  complete,  one  of  the  PIM  methods 
communicates  this  to  the  host,  typically  by  setting  a  Hag  in  the  global  memory,  and  the  host  picks  up  the  results 
from  the  global  memory. 


fhe  overall  memory  structure  for  a  typical  Dl  V.\  application  is  shown  in  Figure  4.  In  the  far  left  column  is  the 
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Fi”iirc  4:  DccoinpositHm  of  a  host  process  address  space  across  multiple 
PIM  iitules.  Shaded  regions  have  supervisor-level  protection. 


virtual  address  space  of  a  tvpical  host  application  process,  where  each  rectangle  lepresents  a  page  or  segment 
from  one  of  the  memory  regions.  Shaded  pages  are  accessible  only  while  in  supervisor  mode,  fhe  label  indi¬ 
cates  whether  the  page  is  used  for  global  data  (Ci),  dumb  user  or  system  pages  (DU  or  DS)  or  internal  user  or 
system  pages  (lU  or  IS),  as  well  as  the  PIM  node  (0-3)  where  the  page  currently  resides,  fhe  second  column 
shows  the  subset  of  pages  v  isible  to  a  method  executing  on  PIM  node  0.  fhe  third  column  shows  the  subset  of 
pages  actually  resident  on  node  0.  fhe  last  two  columns  show  the  same  infoimation  for  node  1 .  Note  that  glo¬ 
bal  pages  are  visible  from  all  nodes,  while  internal  pages  are  visible  only  on  their  local  nodes,  where  they 
appear  at  a  common  v  irtual  address. 
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4  Otiu'rcncc  Maiiiigenicnt 

In  any  system  with  distrihutod  processing:,  the  distributed  information  needs  to  be  kept  coherent,  and  a  consis¬ 
tent  model  of  memory  access  must  be  chosen  and  maintained,  (’onventional  NIJM.V  and  C'OM.A  models  are 
suboptimal  for  irregular,  data-intensive  applications.  Specifically,  in  a  Nl.iM.k  or  C<)M.\  model,  a  reference  to 
remote  data  by  a  local  node  causes  the  remote  data  to  be  automatically  moved  or  copied  to  the  local  node, 
where  it  is  made  available  under  the  same  virtual  address  as  the  remote  version.  In  general,  the  overhead  of 
supporting  this  model  becomes  e.xcessive  for  irregular  applications,  where  there  is  by  definition  great  potential 
for  false  sharing,  and  little  temporal  locality. 

The  philosophy  in  the  DIVA  system  is  therefore  to  move  the  computation  to  the  data,  rather  than  move  the  data 
to  the  computation.  At  any  one  time,  the  data  at  a  virtual  address  is  located  on  exactly  one  PIM  node,  and  there 
are  no  cached  copies  on  other  PIM  nodes.  Cilobal  pages  can  be  moved  from  one  node  to  another  for  load  bal¬ 
ancing  puiposes,  but  this  is  a  heavyweight  operation  that  should  be  used  infrequently  and  explicitly  managed 
by  the  operating  system.  Consistency  of  the  distributed  address  translation  table  must  be  maintained,  but  since 
this  changes  relatively  rarely,  sottware  coherence  methods  are  adequate. 

During  normal  operation,  therefore,  data  coherence  issues  do  not  arise  between  PlMs,  and  there  is  no  need  for 
a  sophisticated,  hardware-supported  coherence  mechanism.  The  movement  of  code  is  a  much  simpler  problem, 
since  code  is  read-only,  and  can  be  replicated  easily.  Moreover,  the  only  references  to  code  that  get  passed  in 
parcels  are  indirect  references  that  index  into  a  method  table,  so  the  translation  mechanism  for  code  references 
is  built  into  the  application,  fhe  result  is  a  memory  model  that  can  be  supported  by  fairly  simple  hardware  in 
the  PIM  nodes,  independent  of  the  host  CPU  details. 

I  he  remaining  coherence  issue,  namely  between  the  PIM  system  and  the  host,  is  the  most  difficult.  Individual 
cache  lines  may  be  cached  by  the  host  processor! s).  The  simplest  solution,  adopted  in  this  prototype,  is  to 
always  explicitly  flush  PlM-accessible  data,  or  keep  it  uncached.  .V  more  transparent  approach  is  for  each  PIM 
to  track  ownership  of  individual  cache  lines,  and  request  writebacks  from  the  CPU  caches  as  necessiiry.  The 
hardware  for  this  on  each  PIM  is  not  excessive  and  scales  well,  so  this  is  a  suitable  long-term  solution.  How¬ 
ever,  broader  issues  suggest  it  is  premature  to  implement  in  our  prototype.  As  stated  at  the  beginning  of  this 
section,  our  goal  is  a  memory  model  that  is  independent  of  w  hich  processor  is  used  as  a  host;  the  mechanism 
for  requesting  writebacks  is  processor  specific,  and  usually  involves  the  requestor  driving  the  address  bus.  In  a 
large  system  with  many  potential  requestors,  this  introduces  significant  arbitration,  electrical  drive,  and  porta¬ 
bility  problems.  In  the  long  term,  it  would  be  better  to  develop  a  standard  (probably  network-based)  memory- 
to-processor  channel  for  this  activity,  w  hich  would  find  other  uses  in  smart  memory  systems. 

.Although  the  explicit  flushing  is  a  burden,  either  to  the  programmer  or  compiler,  it  is  not  expected  to  degrade 
performance  significantly.  In  practice,  even  with  automated  hardware,  the  user  would  probably  obtain  higher 
performance  in  some  applications  by  manually  tlushing  cache  lines  anshow,  to  minimize  the  number  of  write¬ 
back  ret]uests. 

6.0  Developing  Applications  in  DIV  A 

The  success  of  a  new  architecture  is  highly  dependent  on  the  ease  in  w  hich  softw  are  can  be  developed  for  it.  It 
should  be  straightforward  to  develop  correct  programs,  even  if  it  is  somewhat  more  difficult  to  effectively 
exploit  the  performance-enhancing  features  of  the  architecture.  D1  V.A  offers  a  smooth  migration  path  for  devel¬ 
oping  applications.  l  imt,  the  applications  programmer  can  begin  with  a  standard  sequential  program,  which 
will  run  correctly  with  no  modification  by  using  the  PlMs  as  standard  memory.  1  hen,  either  the  compiler  or 
progiammer  can  exploit  the  PlMs  as  smart  memory  in  portions  of  the  application  where  this  is  deenied  profit¬ 
able.  gradually  migrating  the  original  sequential  application  to  make  full  use  of  the  DIVA  architecture. 

To  the  applications  programmer  or  compiler,  the  abstract  DIV.A  architecture  appears  very  similar  to  a  distrib¬ 
uted-shared-memory  multiprocessor,  t  he  host  can  serve  as  a  master  to  coordinate  activities  on  the  PlMs.  i;ach 
node  on  a  PIM  processor  acts  as  a  worker  processor  waiting  for  work,  and  possibly  initiating  woik  on  other 


337 


IMs  tliroiij:h  the  parcel  mechanism.  I  he  memory  assexiated  wiih  a  PIM  node  can  be  thouyht  of  as  its  local 
memory.  The  PIM  node  can  access  a  datum  on  other  memory  chips  through  a  global  address  space  without 
need  to  know  its  exact  location.  Coherence  of  data  shared  across  PIM  chips  is  not  guaranteed  by  the  hardware 
and  must  be  managed  by  either  the  compiler  or  programmer,  similar  to  what  is  required  in  the  Cray  Also 
as  with  distributed-shared-memory  multiprocessors,  locality  of  data  accesses  is  very  important  to  good  perfor¬ 
mance. 

Because  of  these  similarities  to  a  distributed-shared-memory  multiprocessor,  most  parallelizing  and  localits- 
management  compiler  techniques  and  parallel  programming  paradigms  can  be  leveraged  for  DIVA.  Applicable 
compilation  techniques  include  automatic  parallelization  for  both  regular  |Blume%]  |llall%|  and  irregular 
applications  [Rinardd7|,  and  data  and  computation  co-location  |.\nderson9.i|.  Hxplicitly  parallel  programming 
languages  that  permit  some  programmer  control  of  locality  are  also  applicable,  such  as  High  Performance  For¬ 
tran  and  its  extensions  for  irregular  applications.  Olden  |Carlisle‘)5|.  and  CC  i  i  IFosterhsl.  .\s  discussed  in 
Section  4. 1 ,  the  parcel  mechanism  is  really  a  refinement,  tailored  to  the  D1  V.\  architecture,  of  active  messages, 
which  were  developed  for  message-passing  multiprocessor  systems  [voni:icken‘)2|. 

While  there  are  many  similarities  between  programming  for  DIVA  and  parallel  programming,  there  are  several 
important  differences.  One  additional  requirement  is  keeping  the  host  cache  coherent  with  the  PIM  memories. 
.As  discussed  in  Section  5.4,  this  is  accomplished  with  explicit  Hushing,  immediately  prior  to  sending  a  parcel 
from  the  host,  of  objects  in  the  host  cache  that  may  be  touched  by  the  PIM  computation.  In  keeping  w  ith  the 
above  stated  goal  of  making  correct  programs  easy  to  develop,  the  retiuired  flushing  can  be  optionally  auto¬ 
mated  by  the  compiler  through  analysis  of  the  object  and  arguments  associated  with  the  parcel.  Further,  DIVA 
applications  can  exploit  Fine-grain  parallelism  using  the  AS.VP  functional  unit  for  operations  on  aggregate  data 
objects,  which  demands  a  combination  of  compiler  technology  and  a  user  development  environment  for 
exploiting  complex  AS.VP-oriented  computations  string  matching).  Other  high-level  operations  such  as 
memory  management  can  be  optimized  for  the  PIMs  to  improve  the  locality  of  pointer-based  computations.  .As 
an  example,  when  building  a  tree  data  structure  in  parallel,  each  PIM  can  locally  allocate  a  subtree,  with  the 
host  sequentially  connecting  the  subtrees  in  the  upper  level  of  the  trees.  Locality  for  each  subtree  is  then 
ensured. 

.An  important  component  of  the  DIVA  project  is  a  large  software  effort  to  develop  application  programmer 
libraries,  and  compiler  and  run-time  system  support.  The  DIVA  compiler,  either  automatically  or  in  response  to 
programmer  specification,  partitions  computation  and  data  across  host  and  PIMs.  fhis  partitioning  requires 
that  it  must  generate  code  that  interfaces  with  the  operating  system  to  control  data  placement  on  the  PIMs.  gen¬ 
erate  code  to  load  application-specitlc  PIM  code  onto  the  memories,  and  also  generate  parcels  in  the  appropri¬ 
ate  places  in  the  code  to  initiate  PIM  computation,  communicate  and  synchronize,  fhis  high-level  code  must 
then  pass  through  separate  backend  compilers:  one  for  the  host,  for  which  we  can  use  an  existing  native  back¬ 
end  compiler;  and  one  for  the  PIMs,  which  requires  a  DIVA  PlM-specific  backend  that  generates  standard 
RISC  as  well  as  AS.VP  instructions.  There  are  also  separate  run-time  systems  for  the  host  and  PIMs.  I  he  host 
run-time  system  perfonns  similar  functions  to  a  standard  architecture-independent  parallel  run-time  library 
(e.g.,  Pthreads),  managing  threads  and  synchronization,  fhe  PIM  run-time  system  is  a  small,  DIVA-specitlc 
system,  primarily  for  parcel  processing. 

.As  part  of  the  software  development  efforts,  we  are  currently  retargeting  the  Stanford  Sl.'IF  compiler  system  to 
DIVA,  allowing  us  to  take  advantage  of  its  wealth  of  compiler  analyses  for  distributed-shared-memory 
machines.  In  addition,  we  are  developing  an  extensible  approach  to  support  compiler  and  programmer  genera¬ 
tion  of  AS.VP  instructions  that  are  seamlessly  integrated  into  the  PIM  backend.  Since  DIVA  is  targeting  irregu¬ 
lar  compulations,  we  are  also  investigating  a  memory  management  library  for  dynamic  generation  and 
reorganization  of  irregular  data  structures. 

7.0  Case  Studies 

To  derive  preliminary  performance  estimates  for  complete  applications,  we  developed  a  simulator  for  the 
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major  system  components  of  DIVA.  Tho  simulated  architecture  consists  of  a  host  processor,  and  a  numher  of 
PIMs  interconnected  via  a  PiRC’  ring  network.  We  simulate  computations  executing  on  the  host  and  IMMs 
using  Shade  |Cmelik‘)4|.  Shade  executes  application  programs  and  generates  traces  under  the  control  of  a  user- 
supplied  trace  analyzer.  We  simulate  parallel  execution  in  our  experiments  by  recording  the  simulated  time  at 
the  beginning  of  a  parallel  section  and  setting  the  parallel  execution  time  at  the  end  of  the  concurrent  execution 
to  be  the  maximum  value  of  the  simulated  time  by  each  of  the  participating  PIM  nodes.  I  he  Shade-based  sim¬ 
ulator  does  not  directly  model  the  PiRC  interconnection.  To  account  for  network  latency  and  congestion,  we 
generate  traces  of  time-stamped  network  requests  for  each  application,  and  use  these  traces  as  inputs  to  a  net¬ 
work  simulator  |  Draper^Xil.  The  throughput  and  contention  derived  by  the  network  simulator  are  then  used  as 
parameters  to  the  Shade  simulator. 

fhe  PIM  chips  modeled  in  these  experiments  are  much  simpler  than  what  was  presented  in  Section  2.  fhere  is 
a  single  node  per  chip,  and  we  only  consider  applications  that  use  standard  scalar  integer  and  floating-point 
processing  on  the  PIM  nodes  {i.e.,  no  .VS.kP  instructions),  fhese  simplifications  reduce  the  contention  for  on- 
chip  resources,  and  allow  us  to  get  meaningful  early  results  from  the  simple  Shade-based  simulation  strategy. 
We  anticipate  that  the  multiple  processing  nodes  per  PIM  chip  and  the  .kS.AP  functional  units  planned  for  the 
actual  DIVA  implementation  will  yield  much  better  on-chip  computation  rates,  albeit  with  additional  costs  due 
to  contention  for  internal  memory  banks  and  PiRC  channels. 

In  our  simulations,  each  PIM  node  consists  of  a  PIM  processor,  a  2M-byte  memory  bank,  a  host  interface  and  a 
PiRC  network  interface.  Since  processor  technology  is  optimized  for  speed  and  DRAM  technology  is  opti¬ 
mized  for  density  and  yield,  the  PIM  processing  logic  is  expected  to  be  slower  than  the  host  processor  logic. 
Based  on  projections,  we  assume  that  the  PIM  processor  cycle  is  twice  the  host  processor  cycle,  fhe  PIM  node 
memory  bank  is  organized  as  8P)2  2K-bit  memory  rows,  and  the  DR.VM  interface  provides  a  256-bit  sub  row 
per  memory  access.  We  assume  the  first  access  to  a  2K-bit  row  (random-mode  access)  takes  2  PIM  cycles,  and 
each  subsequent  access  to  the  same  row  (page-mode  access)  takes  I  PIM  cycle,  fhese  parameters  are  based  on 
current  memory  speeds  [Kogge^JS]. 

The  host  has  separate  instruction  and  data  on-chip  caches,  and  a  unified  off-chip  second  level  cache.  We  model 
a  parcel  issue  as  a  sequence  of  writes  to  specific  memory  addresses,  the  last  of  which  triggers  the  delivery  of 
the  parcel.  Coherence  between  the  caches  and  memory  is  enforced  by  sofiware  (e.g.,  the  compiler),  using  an 
instruction  to  flush  liata  from  the  cache.  .\t  a  tlush  instruction,  the  simulator  invalidates  the  cache  fine  and,  if 
the  line  is  modified,  writes  it  back  to  memory.  We  summarize  the  simulation  parameters  in  fable  I.  We  now 


Cache  Parameter 

Instriictiuii  [.1 

Data  I.l 

Data  IJ 

Host  Caches 

size 

assiK'iativily 

line  size 
replacement 
write  policy 
latency  (hit) 
latency  (miss) 

.■52  K  hstes 

2 

(4  bytes 

LRU 

write  back 

1  cycle 

1 0  cycles 

.■52  K  betes 

2 

.<2  bjles 

LRU 

write  back 

1  cycle 

10  cycles 

1  M  b\  tes 

2 

.^2  b)1es 

LRU 

write  back 

1 0  cycles 
too  cycles 

Pl.M  Node 

processor  cyvie 

memory  size 
memory  row  size 
memory  latency 

2  cycles” 

2  M  bytes 

256  bits 

1  cycle*(ppge  mode).  4  cycles*  (random  mode) 

PiRC  Network 

channel  width 
nencork  cycle 

.’2  bits 

4  cycles* 
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a.  Host  processor  cycles 

Siiniiliition  Psirainctcrs  used  in  Application  Studies. 


present  results  on  three  applications  evaluated  with  this  simulation  methodology. 

7.1  N.\.S  Sparse  Conjugate  (Jradient  (C'(i!) 

C’G  implements  a  linear  system  solver  using  a  conjugate  gradient  iterative  method.  Its  main  data  structures  are 
three  very  large  arrays  of  floating-point  douhle-precision  values.  The  main  computation  consists  of  a  sparse 
mati  Lx- vector  product  (see  Figure  5(a))  and  accounts  for  about  80%  of  the  total  sequential  execution  time.  The 
computation  is  structured  as  a  single  loop  performing  commutative  and  associative  updates  to  array  Y  indexed 
by  the  values  in  the  ROWIDX  array.  Fhe  sparseness  of  the  computation  is  derived  from  the  indirection  of  the 
accesses  to  the  Y  array  whereas  both  arrays  .\  and  .\  are  accessed  using  simple  loop  indexing  functions. 

To  effectively  map  this  computation  to  DIVA,  we  parallelize  the  execution  of  the  sparse  matrix-vector  product 
by  exploiting  the  commutativity  and  associativity  of  the  addition  operations.  In  this  version,  each  PIM  node 
has  a  local  copy  of  array  Y  (named  PRI\'_Y),  and  performs  its  updates  on  its  own  private  copy;  aller  all  PlMs 
complete  their  local  computation,  the  local  results  are  merged  using  a  parallel  reduction  algorithm,  fhe  parallel 
reduction  algorithm  ensures  that  there  is  no  network  contention  during  the  communication  phase.  However, 
since  each  PIM  node  has  to  communicate  its  copy  of  array  Y  to  other  nodes,  the  total  amount  of  communica¬ 
tion  increases  with  the  number  of  PI  Ms,  as  well  as  the  number  of  steps  of  the  parallel  reduction.  I  his  example 
makes  use  of  a  lightweight  run-time  system  and  the  parcel  communication  mechanism  to  generate  and  manage 
concunency.  The  basic  code  generation  strategy  is  for  the  compiler  to  split  the  computation  between  the  host 
and  the  PIM  nodes  and  to  initiate  the  computation  on  the  PIMs  by  sending  parcel,  using  the  SendParcel  primi¬ 
tive.  PIM  nodes  are  activated  by  the  receipt  of  a  given  parcel  and  proceed  to  execute  the  code  associated  w  ith 
it.  riiis  code  might  in  turn  generate  other  concurrent  computation  on  the  same  or  on  other  PIM  nodes.  Fhe  host 
can  enforce  termination  of  a  given  computation  using  an  explicit  barrier  synchronization  construct  (Barrier)  or 
implicitly  through  memory.  .Also  included  in  this  run-time  system  is  a  Flush  primitive  that  allows  the  compiler 
to  maintain  the  consistency  of  the  data  between  the  host  caches  and  the  PIM  nodes. 

(a)  Original  Loop  Nest. 

DO  J  =  1,N 

DO  K  =  COLSTRIJI,  COLSTRU+H  1 

Y[RO\VTDX[Kl]  =  Y[RO\VIDX[Kl)  +  A[K1  *  X[J1 

(h)  DIVA  Host  Program. 

Flush  (AO: 

PartitionSize  =  Sizeof  (ROWIDX)  /  NumPiniNodes; 

for  (i=0;  i<NUM_PIMNODES;  i++)  { 

Sencl_parcel  (RO\VIDX[I*PaititionSize],  LoopBody,  PartitionSize, 

AIPPartitionSizel,  PRIV_COLSTR[O.I1,PRIV_XIO.I1,Y): 

} 

BarrierO: 

(c)  Code  for  PIM  node  coiiiniand  l.oopBotly. 
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BarrierEntei'O; 
for  0=1:  j<=N:j++)  { 

Lower  =  Max(PRIV_COLSTR[Jl,  PIMID‘PartitionSize); 

Upper  =  Min(PRIV_COLSTR[J+ll-l,  (PIMID+l)*PartitionSize-l): 
for  (i=Lovver;  i<=Upper;  i++)  { 

K1  =  K  PIMID‘PartitionSize; 

PRIV_YIRO\\lDX[Klll  =PRIV_YIRO\VIDX[Klll  +  (A[K11  *  PRIV_X[J1) 

} 

} 

ParallelReduction(Y.PRI\_Y.PIMID.NUM_PIMNODES); 

Ba  rriei'Re  lease  () : 

Fisiirc  5:  C’(i  Matrix-vector  product  and  its  mapping  to  DIN  A. 

f  igure  5(h)  and  1  igiire  5(c)  present  the  coiresponding  code  for  the  DIVA  arcliitectiiie,  which  makes  use  of  the 
parcel  and  synchronization  primitives  to  orchestrate  the  computation,  f  igure  6  illustrates  graphically  the  data 
mapping  for  the  various  an  ays  in  this  computation  for  a  system  with  4  PIM  nodes. 


PIM  0 


PIM  I 


PIM  2 


PIM 


partitioned 

.V 

replicated 

COLSfR 

partitioned 

RONVIDX 

privatized 

PRIN_V 

partitioned 

\ 

repi  ic.ated 

X 

II 

II 

r 

[ 


Figure  6:  Data  Mapping  on  DIN  A  forCCJ. 


figine  7  illustrates  simulation  results  for  this  application.  We  separate  original  sequential  execution  into  sev¬ 
eral  components  in  figure  7(a).  The  host  busy  category  accounts  for  the  time  spent  executing  instructions.  The 

1. 1  and  1.2  miss  stall  categories  lepresent  time  spent  waiting  for  memory  accesses  to  be  satisfied  from  either  the 

1.2  cache  or  main  memory.  In  the  version  of  the  program  that  executes  the  matrix-vector  pioduct  on  the  PIMs, 
we  show  time  spent  in  the  host  and  on  average  in  one  PIM,  and  we  include  additional  categories  (the  host  is 
idle  during  PIM  execution,  so  this  is  an  accurate  retlection  of  overall  e.xecution  time),  fhe  coherency  overhead 
refers  to  time  spent  by  the  host  flushing  cache  lines  prior  to  e.xecution  on  the  PIMs.  Note  that  additonal  coher¬ 
ency  overhead  is  charged  as  1. 1  and  1.2  cache  misses  in  the  host  when  PIMs  are  used;  by  Hushing  data  from  the 
cache  prior  to  PIM  computations,  extra  cache  misses  in  the  host  may  occur  in  later  host  computation.  I  his 
cache  miss  effect  due  to  flushing  is  not  significant  in  the  programs  presented  here  because  the  irregular 
accesses  in  the  PIM  computations  were  polluting  the  host  cache  when  executed  on  the  host.  .Additional  catego¬ 
ries  show  time  spent  in  the  PIMs.  including  PIM-to-PlM  communication  overhead  and  time  spent  in  local 
memory  stalls. 
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(a)  i;\eciition  Uroakdown  (h)  Speedup 

Fif’iii'f  7:  Kxccution  Breakdown  and  Speedups  for  C'(i 


As  the  results  in  l-igure  7ia)  indicate,  the  original  application  slitters  significantly  from  poor  cache  locality 
with  overall  1. 1  and  1.2  cache  miss  rates  of  15%  and  20%.  respectively.  When  the  niatri.x-vector  product  is  e.\e- 
cuted  on  the  PIMs,  much  fewer  accesses  to  array  Y  are  brought  into  the  host,  and  the  miss  rates  on  1.1  and  1.2 
cache  were  reduced  respectively  to  10%  and  7%.  .Vs  figure  7(a)  shows,  this  contributes  to  a  signillcant  reduc¬ 
tion  of  the  application  time  waiting  for  results  from  memory,  f  igure  7(b)  shows  the  overall  application  speed- 
ups  for  different  numbers  of  PIM  nodes  as  compared  to  the  entire  application  executing  on  the  host.  .\l  16 
[*IMs,  the  application  speedup  is  more  than  S  over  the  original  sequential  execution  time.  While  this  applica¬ 
tion  scales  very  well  for  up  to  16  PIM  nodes,  the  problem  si/e  we  use  is  too  small  relative  to  the  overhead  of 
the  reduction  computation  to  scale  much  beyond  .'^2  IMMs. 

7.2  Ihisli-Bascd  Natural  .loin 

[he  Natural  Join  is  a  fundamental  operation  in  relational  database  systems.  It  consists  of  generating  all  possible 
combinations  of  tuples  for  two  relations  R  and  S  with  a  common  attribute  V.  In  the  implementation  used  in 
these  experiments,  the  algorithm  builds  a  hash  table  for  each  of  the  relations  R  and  S  indexed  by  the  attribute  .V. 
[  hen,  for  each  hashed  value  in  the  table,  the  algorithm  joins  all  tuples  of  the  two  relations  that  have  a  common 
value  for  the  attribute  A. 


(a)  I'Aecution  breakdown 


(b)  Overall  Speedup 


(c)  Speedup  for  Join  phase 


Figure  8:  Fxeeution  Breakdown  and  Speedups  for  Natural  Join. 


['he  strategy  to  map  this  application  to  DIVA  is  to  distribute  the  hash  table  along  contiguous  blocks  of  the  table 
entries.  Ilach  PIM  node  has  a  set  of  consecutive  entries  of  the  hash  table  and  the  hash-table  collision  lists  coire- 
sponding  to  each  of  the  table  entries  it  owns.  Once  the  host  processor  has  constructed  the  distributed  hash  table. 
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natural  join  operation  proceeds  by  liax  inj;  each  PIM  node  computing  a  local  natural  Join  operation.  .\t  the 
end.  the  host  simply  scans  the  partial  hash  tables  local  to  each  PIM  node  to  read  the  results. 

I•igure  8  shows  peiiormance  results  when  the  first  phase,  constructing  the  local  hash  tables,  is  performed  by  the 
host,  and  the  second  phase,  the  join  of  local  hash  tables,  executes  in  the  PIMs.  The  speediips  for  the  join  phase 
of  the  computation  are  superlinear,  as  shown  in  Figure  8(c).  These  superlinear  speedups  result  from  the  com¬ 
bined  effects  of  the  smaller  memory  latencies  at  the  PIMs,  as  compared  to  the  cache  miss  latencies  suffered  by 
the  host,  and  the  parallelism  obtained  by  distributing  the  computation  across  PIMs.  Due  to  .Xindahl's  Taw,  the 
overall  speedup  is  limited,  as  the  first  phase,  which  accounts  for  about  half  the  baseline  execution  time,  is  exe¬ 
cuted  sequentially  on  the  host,  liven  more  speedup  is  possible  from  two  sources,  both  of  which  we  are  explor¬ 
ing:  (I)  building  portions  of  the  local  hash  table  in  parallel  on  the  PIMs  and  merging  the  results;  and.  (2) 
performing  in  parallel  on  the  .AS.VP  unit  the  comparison  of  a  key  from  the  R-tuple  with  that  of  several  S-tuples 
with  the  same  hash  value. 

7.3  Object-Oriented  Database  Beiiebinark  ((M)7) 

The  007  application  implements  a  representative  object-oriented  database  for  C’.AD  applications.  The  database 
schema  defines  several  one-to-one  and  one-to-many  relationships  among  database  objects.  These  objects  con¬ 
sist  of  documents,  manuals  and  base  or  complex  assembly  components,  ^;ach  complex  assembly  component  is 
defined  hierarchically  in  terms  of  other  base  or  complex  assemblies  or  base  assemblies.  Base  assembly  compo¬ 
nents  are  defined  in  terms  of  composite  parts  which  in  turn  consist  of  more  than  one  library  atomic  part.  Fach 
of  these  objects  have  specific  attributes  such  as  a  unique  identifier,  creation  date  and  other  type-specific  fields. 

This  database  application  was  originally  developed  at  the  University  of  Wisconsin  to  study  the  performance  of 
various  database  management  systems  |007|.  We  have  ported  this  application  to  a  C  i  i  stand-alone  program  by 
implementing  the  dictionary  and  relations  abstraction  using  hash-tables  and  linked  lists  in  a  total  of  0,000  lines 
of  C't  t  code.  Our  performance  evaluation  concentrates  on  a  specific  database  query,  query  #6.  Query  #6  finds 
all  assemblies  (base  or  complex)  B  that  reference  (directly  or  transitively)  a  composite  part  with  a  more  recent 
build  date  than  B’s  build  date.  This  query  is  implemented  using  set  operations  over  the  database  relations  and 
extensively  uses  the  iteration  abstraction  from  C'  i  i  to  access  successive  objects  in  a  given  relation. 

Besides  the  overall  organization  of  the  database  objects  in  a  graph  data  structure,  the  database  schema  also 
relies  heavily  on  singly-linked  and  hash-table  pointer-based  data  structures  for  indexing  of  the  object  in  each 
category  (drx.niments,  manual,  base  assemblies,  etc.).  The  primary  access  pattern  over  the  indexing  structure 
traverses  a  singly-linked  list  or  a  hash-table,  searching  for  a  particular  subset  of  objects  matching  a  given  pred¬ 
icate.  In  addition,  the  application  also  traverses  the  overall  graph  structure  of  the  objects  in  the  database.  Such 
traversals  perform  poorly  on  conventional  systems  because  they  exhibit  almost  no  temporal  reuse  of  memory 
accesses,  and  there  is  little  spatial  locality  due  to  the  way  the  pointer-based  data  structures  are  created. 

To  take  advantage  of  the  PIM  architecture,  we  perform  two  key  transformations  on  the  original  application. 
The  first  transformation  takes  advantage  of  the  fact  that  the  computation  accesses  a  set  of  objects;  the  order  in 
which  the  elements  of  the  set  are  accessed  by  the  application  is  irrelevant,  so  these  accesses  can  be  performed 
in  parallel.  The  second  transformation  restructures  the  code  so  that  the  PIM  nodes  traverse  the  linked  data 
structure  that  represents  the  relations  in  the  schema  and  selects  the  set  of  objects  the  computation  needs  to 
access,  liach  PIM  selects  a  subset  of  the  objects  in  the  relation  from  its  local  memory  only.  The  host  then  gath¬ 
ers  the  partial  results  and  constructs  a  larger  set.  The  host  is  responsible  for  any  updates  to  the  storage.  Figure  9 
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shows  the  cxoculion  linio  hrcakdown  aiul  spoeilups  for  007. 


(a)  Ivxocution  hroakdown 


Figure  0:  Kxccutioii  Breakdown  and  Speediips  for  007. 


file  results  show  an  impressive  siiperlinear  speedup.  As  the  e.xectition  breakdown  reveals,  this  restill  is  due  to 
the  severe  perlormanee  impact  of  the  L2  miss  stalls  (almost  80%  of  the  sequential  computation  for  this  query) 
for  the  case  where  only  the  host  executes  the  computation.  When  the  computation  is  partitioned  across  the  PIM 
nodes,  each  PIM  fetches  data  from  its  local  memory  and  communicates  ver\  infrequently,  fhe  overhead  of 
coherence  is  also  negligible  for  all  runs,  as  the  query  does  not  update  the  objects  in  the  database  but  rather  col¬ 
lects  overall  statistics.  .\s  a  result,  the  performance  scales  well  up  to  16.  for  .'2  PIM  nodes,  speedup,  while  still 
impressive,  trails  off  a  little  due  to  the  relative  frequency  of  communication  compared  to  computation  for  this 
data  set  size. 

8.0  Conclusions  and  Fufure  Work 

I  his  paper  has  described  the  DIV.A  system,  an  architecture  incorporating  PIM  devices  as  smart  memories  to 
one  or  more  external  host  processors,  other  distinguishing  features  of  DIVA  include  its  PIM-to-PIM  intercon¬ 
nect  and  explicit  support  for  in-memory  operations  on  irregular  data  structures.  In  this  paper,  we  presented  sys¬ 
tem-level  requirements  for  in-memory  acceleration  of  irregular  applications.  We  presented  three  case  studies, 
sparse  conjugate  gradient,  natural  join  and  an  007  database  query,  to  demonstrate  how  irregular  applications 
can  be  mapped  to  the  DIV.A  architecture.  High-level  simulation  results  show  a  speedup  for  all  three  applica¬ 
tions,  resulting  from  increased  processor-memory  bandwidth,  much  more  effective  use  of  cache  on  the  host 
processor,  lower  latency  accesses  and  parallelism. 

I'Uture  descriptions  of  the  DIV.\  project  will  include  details  of  the  PIM  VLSI  device,  architecture  studies  using 
a  high-fidelity  system  simulator  based  on  RSIM,  the  DIV.\  compiler  and  iim-time  systems,  and  further  applica¬ 
tion  studies. 
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This  paper  presents  a  fast,  simple  router  design  for 
impienienting  the  Red  Rover  aigorithm  for  a  bidireetionai  ring. 
This  design  is  sers'  siiitabie  for  the  Data-Intensive  Arehiteeture 
(DI\A)  system,  a  system  sshieh  demonstrates  the  benefits  of  em¬ 
bedded  DRAM  teehnology.  beeause  of  its  bigb  perfomianee  as 
sseli  as  simple  arehiteeture  and  iow  eost.  The  key  attributes  of  this 
router  are  one  eioek  node-to-node  iateney,  high  ehannei  through¬ 
put.  and  simpie  hardsvare  inipiementation.  The  router  arehitee¬ 
ture  empioys  short-eut  FIFO  data  paths,  sshieh  makes  the  router 
speed  independent  of  the  ehannei  buffer  size  (in  terms  of  flits). 
A  prototype  implementation  of  the  router  aehieves  a  maximum 
ehannei  bandssidth  of  S.I2  Gb/s  and  runs  at  80  MHz  nsing 
CMOS  signaling  in  teehnoiogy.  This  high  throughput  and 

ioss  latency  Here  achieved  ssithont  resorting  to  the  use  of  complex 
high-speed  signaling  technologies. 

1.  Introduction 

I'mboddcd  DRAM  technology  isgrsswing  in  popul.irity.  as  it 
appears  to  be  a  promising  solution  to  the  increasing  gap  be¬ 
tween  processor  and  memory  spevds  |6|.  Integrating  prisces- 
sor  logic  and  meiiKsry  in  processing-in-memory  (PI M)  chips 
offers  dramatic-ally  increased  memory  bandw  idths  (up  to  2  or¬ 
ders  of  magnitude)  over  consentional  systems.  ITirthermore. 
memory  latency  is  also  reduced  because  internal  memoiy  ac¬ 
cesses  avoid  the  delays  associated  w  ith  communicating  olf 
chip.  The  Data-Intensive  .Architecture  (DIV.A)  system  aims  to 
exploit  this  technology  by  combining  PIM  dev  ices  with  one 
or  more  external  host  processors  and  a  PIM-to-PlM  intercon¬ 
nect  |9|  The  DI\  A  system  design  imposes  a  unique  set  of 
requirements  on  the  PIM-to-PIM  interconnect.  PIM  chipsw  ill 
be  physically  grouped  as  comentional  memoty'  chips,  mounted 
on  DIMM  modules.  Ihe  number  of  PIM  chips  in  a  hosted 
cluster  is  therefore  in  the  range  of  32  to  (vf  chips,  depending 
on  how  many  PIM  chips  can  be  packed  onto  a  DIMM  mod¬ 
ule.  rhe  PIM-to-PIM  interconnect  must  then  be  amenable  to 
the  dense  packing  requirement  of  DIMM  modules.  Tow  la¬ 
tency  and  high  throughput  are  also  desirable  properties  of  this 
interconnect.  Turthermore.  this  network  must  be  scalable  to  al¬ 
low  the  addition  or  removal  of  modules.  This  combination  of 
requirements  favors  a  one-dimensional  network.  Recently  im¬ 
plemented  routers  such  as  S(il  SPlDliR  |5|  and  the  C  ray  TAT 
network  router  |8|  are  not  suitable  to  be  embedded  in  PIM  de¬ 
vices  because  of  complexity  and  size.  The  resulting  PIM  Rout¬ 


ing  Component  (PiRC)  is  a  one-dimensional  wormhole  A'Uter 
which  implemetits  the  Red  Rover  routing  algorithm  to  effect 
deadlock-free  routing  in  bidirectional  rings  1 1|.  |.>|.  The  Red 
Rover  algorithm  prov  ides  a  more  ev  en,  symmetric  distribution 
of  message  tralTic  among  virtual  channels  in  a  bidirectional 
ring  and  therefore  attains  lower  latencies  and  higher  through¬ 
put  than  Daily’s  spiral  algorithm  |4|  .Additionally,  the  PiRC 
routes  fixed-size  packets  and  uses  source  routing  to  achieve 
low  latency.  The  PiRC  architecture  is  presented  in  detail  in 
Section  IT.  Section  111.  describes  implementation  and  perfor¬ 
mance  issues.  Simulation  scenarios  for  testing  functionality 
are  presented  in  Section  IV..  and  concluding  remarks  are  giv  en 
in  Section  V.. 

IT  Router  .\RCHtTECTURE 

Because  it  employs  the  Xfi/  Raver  algorithm,  the  PIM  Rant¬ 
ing  Campanent  (PiRC)  has  a  veiy  simple  architecture  and  may 
be  viewed  as  two  identical  routers  w  hich  are  time-multiplexed. 
One  router  v'perates  on  the  rising  transition  of  the  clock  while 
the  other  operates  on  the  falling  transition.  In  this  manner, 
two  virtual  channels  (.A  and  B)  are  time-multiplexed  onto 
each  physical  channel.  Tach  virtual  router  contains  controlling 
logic,  consisting  of  an  input  controller,  switch,  and  output  con¬ 
troller.  and  short-cut  Tl  TO  data  paths  (see  figure  I ).  .A  channel 
input  cantraller  receives  control  signals  from  a  sender  and  gen¬ 
erates  control  signals  for  storing  data  into  a  short-cut  ITTO. 
The  switch  ami  autput  cantraller  determines  to  w  hich  output 
port  input  data  should  be  forw  arded  and  arbitrates  fairly  among 
contending  requests  for  a  particular  output  port.  The  handshak¬ 
ing  protocol  between  sending  and  receiv  ing  PiRC  channels,  de¬ 
scribed  in  Section  .A.,  is  also  veiy  simple  and  elTicient. 

Other  factors  also  contribute  to  the  simplicity'  of  the  PiRC 
architecture.  A  packet  is  constrainevl  to  a  tixed  length  often 
.52-bit  tlits.  and  the  phit  size  is  the  same  as  the  flit  size.  .All  op¬ 
eration  including  receiv  ing.  switching,  arbitrating,  and  sending 
is  done  in  a  half  clock  cycle.  Ihiis.  only  one  clock  is  needed  for 
a  flit  to  traverse  from  one  node  to  the  next  in  the  non-blocking 
case.  The  PiRC  implements  wormhole  routing  (7|  so  that  flits 
of  a  blocked  packet  remain  in  place  in  the  network  channels. 
However,  each  PiRC  ITTO  contains  enough  space  to  buffer  a 
complete  .52ii-bit  packet.  1  his  ability  simplifies  the  handshak¬ 
ing  so  that  handshakes  need  only  occur  on  packet  boundaries 
rather  than  on  every  flit. 
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Fig.  1.  PIM  Routing  Component  (PiRC)  Block  Diagram 

l  iguro  2  shows  tlie  internal  interface  for  one  \  irtual  level  of 
the  PiRC  (the  other  level  is  identical).  I  his  fiiiiire  shows  how 
the  rifOs.  input  controllerst INC).  switches(SW).  and  output 
controllerslOUTC)  interact  for  the  positive!  ■ ).  negative!-),  and 
processing  clemen!!Pe)  directions.  Note  especially  the  switch¬ 
ing  and  merging  cotnbinations  in  the  data  paths.  A  packet  en¬ 
tering  the  !  ■ )  t'lE'O  may  continue  in  the  !  ■ )  direction  or  exit 
the  network  through  the  Pe  port.  Similarly,  a  packet  entering 
the  !-)  I  ll'X)  may  continue  in  the  !-)  direction  or  exit  the  net¬ 
work  through  the  Pe  port,  f  inally,  a  packet  which  is  injected 
via  the  Pe  flfO  may  enter  the  network  via  the  !  • )  or !-)  |xirl. 

I  hese  routing  restrictions  result  in  2-way  switchers  and  2-way 
mergers  at  ever>'  pc'int  of  contention.  1  his  artifact  simplifies 
the  router  design,  requiring  the  design  of  only  one  merge  and 
one  svv  itch  element  that  are  then  replicated  as  needed,  fhe  SI 
and  lU  signals  are  send  and  reiuly  handshaking  signals  for  the 
input  channels,  while  the  SO  and  RO  signals  corrcsixind  to 
output  channel.s.  More  detail  about  their  operation  is  given  in 
the  following  section 
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Fig.  2.  PiRC  Internal  Interface 


A.  Hiimisluiking  Protocol 

fhe  handshaking  protocol  is  really  simple  and  eflicicnt.  first, 
note  that  the  SI  and  R1  signals  of  a  receix  ing  Pi  RC  channel  are 
connected  to  the  SO  and  RO  signals,  respectively,  of  a  neigh¬ 
boring  sending  PiRC  channel,  fhe  receiver  keeps  asserting  the 
RI  signal  as  long  as  its  corresponding  flfO  is  not  full,  fhe 
sender  keeps  sampling  the  corresponding  RO  signal  at  exeiy 
edge  of  the  ckx-k  and  starts  sending  a  pending  message  when- 
e\  er  the  receixer  is  ready.  Uy  using  this  protocol,  the  sender 
constantly  monitors  the  state  of  the  receiver  and  does  rx't  xvaste 
time  to  explicitly  request  the  status  of  the  receixer  I'lfO.  As  de¬ 
picted  in  figure  .'.this  protocol  makes  it  possible  forthe  sender 
to  make  a  decision  ti>  send  data  as  soexn  as  an  asserted  RO  is 
sampled.  To  indicate  it  is  sending  data,  the  sender  asserts  SO. 
and  the  receixer  latches  DIN  data  into  the  fIfC)  upon  .sam¬ 
pling  the  corre.sponding  asserted  ,5/ signal,  fhe  receiver  then 
latches  data  on  the  next  nine  clock  cycles  to  receive  the  entire 
packet. 

SKvorjt 
(1.K 

K<)  _ I 


so 


IXIUT 

Fig.  J.  Handshaking  between  a  sender  and  a  receiver 
B.  Short-Cut  FIFO 

In  order  to  achieve  high-speed  data  transmission  along  the 
physical  channel,  fast  sxvitching  actix  ity  betxveen  channels  is 
essential.  .A  prx»x  ious  implementation  of  a  Red  Rover  router, 
the  PDSS  router  [2 1.  specilied  a  simple  controller  and  complex 
nit  buffer  design  and  is  .suitable  for  only  a  small  number  of  Hits 
per  packet.  In  the  I’DSS  router,  there  are  a  large  number  of 
nit  bulTers  that  can  drive  the  final  output  stage  bus.  as  shoxvn 
in  figure  4.  Ibis  arrangement  results  in  a  large  capacitive 
load,  fhe  controller  is.  hoxvexer.  verx'  simple  such  that  finite 
state  machines  x\  ithoul  peripheral  kxgic  are  .sufficient  for  con¬ 
trolling  the  regi.ster-tristate  buffer  pairs.  In  contrast,  the  PiRC 
implements  a  complex  controller  and  simple  fIfOs  in  order 
to  accommodate  a  large  number  of  Hit  buffers  in  the  channel 
buffer.  In  fact,  the  output  stage  load  capacitance  is  indepen¬ 
dent  of  the  number  of  Hit  buffers  in  a  short-cut  flfO  because 
only  the  top  element  of  the  f  1 1  ()  is  capable  of  drix  ing  the  out¬ 
put  stage  bus.  I  bis  characteristic  makes  the  design  veiy  llex- 
ible  xvilh  regard  to  channel  size  and  is  important  as  different 
package  types  im|xxsc  different  pin-count,  and  therefore  chan¬ 
nel  size,  constraints.  With  this  design,  every  Hit  in  the  flfO 
shifts  toxvard  the  tx'p  of  the  flfO  as  long  as  the  path  is  not 
blocked.  .Also,  incoming  Hits  are  placcxi  in  the  tirst  empty  flit 
buffer  (from  the  top  of  the  flfO).  figure  5  illustrates  the  cell 
of  the  flfO.  bkx'k  diagram,  and  an  example  of  flit  moxement. 
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Fig.  4.  Channel  Buffer  Design:(a)  Register-Tristale  Buffer  in  PDSS  Router 
and  (b)  Short-cut  FIFO  in  PiRC 
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Fig.  5.  Short-Cut  FIFO:  (a)  FIFO  Cell,  (b)  Block  Diagran^  and(c)  Movements 
of  Flits 


I  hc  cell  in  (a)  has  two  data  inputs,  one  from  the  neichboriny 
nil  in  the  I'll'O  (NCiH)and  the  other  from  the  external  input  for 
this  router  channeltEiXT).  t  he  decoder  in  (h)  generates  proper 
control  signals  for  the  I'll'O  based  on  current  conditions  and  a 
w  rite  pointer  indicator,  (c)  is  an  example  showing  the  move¬ 
ment  of  flits.  Until  'n  there  is  no  bliicking;  therefore,  the  llit 
in  llit  hufl'er  n  goes  out  and  the  other  flits  shift  toward  the  top 
of  the  I'll'O  so  that  the  I'll'O  depth  is  constant.  New  flits  from 
the  external  input  are  loaded  into  flit  hufl'er  2  (this  example  as¬ 
sumes  some  residual  flits  exist  in  the  I'll'O  initially),  f  lits  do 
not  shift  if  the  output  path  is  blocked,  as  shown  during  14.  I'f. 
and  fb.  However,  the  w  rite  poitiler  increments  so  that  subse- 
c|uent  incoming  flits  begin  to  till  up  the  I'll'O.  W  hen  the  path 
becomes  unblocked,  flits  drain  out  as  show  n  from  f?. 

In  order  to  keep  track  of  the  header  flit  of  a  packet,  the  SI 


signal  flows  through  a  one-bit  I'll'O  as  the  header  flit  moves. 
I  he  operation  of  this  one-bit  I'll'O  is  identical  to  that  of  the 
data  I'll'O  de.scribed  above,  fhis  signal  becomes  the  output 
signal  SO  in  the  tinal  output  stage,  indicating  that  the  router 
channel  is  sending  a  header  flit  of  a  new  packet. 

C.  Input  Controller 

the  input  controller,  .shown  in  figure  6.  is  simple  counier- 
ba.sed  logic  that  directs  the  loading  of  flits  into  the  short-cut 
fll'O.  When  the  input  controller  samples  an  asserted  SI  sig¬ 
nal.  it  begins  latching  the  flits  of  an  incoming  packet,  fhe 
tip/domt  counter  dynamically  changes  the  w  rite  pointer  \alue. 
which  always  points  to  the  first  empty  space  in  the  fll'O.  as  the 
router  reads  and  writes  flits,  fhe  uKti  (w  rite  enable)  genera¬ 
tor  causes  the  up/cloiitt  counter  to  increment  the  write  pointer 
value  when  SI  arrives  from  the  .sender,  fhe  rcn  (read  en¬ 
able)  generator  is  aetivated  when  the  output  controller  starts 
forwarding  flits  from  the  I'll'O  to  an  output  channel,  and  it 
also  prevents  the  reading  of  garbage  in  an  empty  fll'O.  fhe 
counter  operates  at  both  clock  edges  so  that  it  can  increase  the 
w  rite  pointer  at  the  rising  edge  w  hen  a  new  flit  is  written  and 
decrease  the  pointer  at  the  falling  edge  w  hen  a  flit  is  read  from 
the  fIf'O  (these  clock  edges  apply  for  .\  virtual  channels 
for  If  virtual  channels,  the  opposite  clock  edges  .apply),  fhe 
fuU-empty  detector  indicates  the  status  of  the  fll'O.  fhe  RI 
hand.shaking  signal  is  merely  the  inverse  of  the  f-nll  signal. 
.As  mentioned  earlier,  the  decixler  translates  the  write  pointer 
value  into  proper  control  signals  for  the  f  ll'O. 


\VG  write-enable  generator 

RG  read-enable  generator 

UDC  u|>/doi«  n  counter 
FED  full/empty  detector 
DCD  decoder  for  translating 
counter  output  to  FIFO 
control  signals 


full 


D.  Switch  and  Output  Controller 

fhe  output  controller  samples  RO  at  eveiy  clock  edge  so  that 
it  can  send  a  pending  packet  as  soon  as  |xissible.  Once  RO  is 
.asserted  and  detected  by  the  output  controller,  the  header  flit 
of  a  pending  packet  and  SO  are  sent  immediately.  W  hile  flits 
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arc  being  Iransmitted  to  the  recoi\er.  the  write  pi’iinter  of  the 
sending  I'H'O  deerements  if  there  are  no  incoming  Hits  from 
the  neighboring  PiRC.  On  the  other  hand,  the  pointer  keeps 
pointing  to  the  same  Hit  hulTer  in  the  .sending  I'lF'O  ifihe  .send¬ 
ing  I'lI'O  is  simultaneously  receiving  data  from  its  neighbor. 
The  switch  determines  the  direction  in  which  a  packet  is  to  be 
forwarded,  fhe  lirst  Hit  of  a  packet,  the  header,  contains  rout¬ 
ing  information  for  the  switch,  fhe  header  is  unaiy  encodc’d 
such  that  the  numberof  hopsa  packet  is  to  traverse  is  indicated 
by  the  numberof  I's  set  in  the  header,  fhe  header  is  shifted 
at  each  hop  so  that  this  value  is  decremented.  I  herefore.  the 
switch  simply  inspects  the  first  bit  of  the  routing  header  to  de¬ 
termine  which  output  pisrt  to  request  for  a  given  packet,  l.lsing 
a  lirst-come- first-served  policy,  the  output  controller  arbitrates 
fairly  between  requests  from  two  fll'Os  contending  for  usage 
of  the  same  output  physical  channel.  If  contending  requests 
arrive  in  the  same  clock  cycle  to  an  idle  output  controller,  an 
aibitrarv  selection  is  performed;  however,  the  f'lf'O  which  is 
not  granted  acce.ss  during  this  arbitration  is  guaranteed  access 
w  hen  the  current  I'lI'O  completes  ba.sed  on  the  tirst-come-lirst- 
served  policy. 

III.  Implementation  .AND  Performance 

fhe  PiRC  design  was  begun  by  behavioral  modeling  in 
\  IID1.  and  cvvmpiled  with  Synopsys.  Cascade  KIXK'H  was 
used  for  routing  and  placement  as  well  as  layout  generation  for 
a  prototype  implementation.  Control  blocks  were  synthesized, 
while  the  short-cut  I'lI'O  was  generated  using  custom  layout 
to  achieve  high  density.  We  tested  our  design  at  the  behav  ioral 
level,  pre-synthesis  level,  and  |xvst-synlhesis  level  with  Sy  nop- 
sys.  and  transistor  level  with  Povvermill. 

fhe  resulting  PiRC  prototype  layout  is  for  the  I  IP  14b  process 
available  through  MOSIS.  Phis  process  u.ses  O.Jymi.  .P-layer 
metal  CMOS  technology,  fhe  PiRC  has  a  die  size  of  2.76  mm 
.\  2..P6  mm  and  contains  75.276  transistors.  Simple  hardware 
based  on  an  efficient  routing  algorithm  allows  us  to  achieve  a 
clock  frequency  of  80Mllz.  fhe  router  operates  on  both  clock 
edges,  leading  to  a  channel  bandwidth  of  5.120h  s.  Only  one 
clock  is  required  for  a  flit  to  move  from  one  mxle  to  the  ne.xt. 
resulting  in  a  node-to-node  delay  of  12.5ns.  Figure  7  shows 
the  layout  of  the  PiRC.  placed  and  routed  with  the  floor  plan  of 
F  igure  I .  Although  this  prototy  pe  achieves  respectable  perfor¬ 
mance.  we  expect  performance  to  improve  significantly  when 
we  migrate  to  a  currently  available  embedded  DR.\M  process 
using  0.25/im  or  even  0. 18/im  technology,  such  as  the  IlfM 
S.'\27-l;  or  FSMC  process. 

I\'.  StMUL.ATtON 

Five  critieal  scenarios  were  used  to  verify  the  PiRC  design, 
fhe  external  PiRC  connections  u.sed  for  simulation  are  shown 
in  Figure  8.  fhis  configuration  allows  short-cut  FIFOs  to 
be  cascaded  together  so  that  one  I'lI'O  essentially  feeds  an¬ 
other.  The  header  flit  of  a  packet  is  set  in  simulation  to  specily 


Fig.  7.  Layout 

whether  the  corresponding  packet  travels  from  the  (Pe)  FIFO 
to  the  ( + )  I'lI'O  or  the  (-)  FIFO,  fe.st  vectors  are  injected  on  the 
fester  terminals  indicated  in  F  igure  8.  which  es.sentially  serve 
as  privcessing  element  signals,  fhe  scenarios  are  as  following: 


Fig.  8.  Router  Configuration  for  Testing 


1 .  I'vvo  messages  move  back-to-back  without  blocking. 

2.  fwo  me.ssages  move  back-to-back,  fhe  lirst  message  is 
blocked  until  the  (+.-)  FIFO  is  full.  C  onsequently,  the 
second  message  is  blocked  in  the  (Pe)  FIFO  and  starts 
filling  it.  I  hen.  the  lirst  message  becomes  unblocked  and 
drains  out.  .As  soon  as  the  lirst  message  starts  mov  ing  out. 
the  second  message  follows  it  along  the  path. 

fwo  messages  move  back-to-back,  fhe  lirst  message  is 
blocked  until  the  (-.-)  FII'O  gets  half-way  full,  and  then 
the  lirst  message  drains  out. 

4.  The  lirst  message  is  blocked  until  the( » .-)  FIFO  tills  half¬ 
way.  and  when  the  lirst  message  starts  draining  out  of 
the  ( • .-)  I  If'O.  the  second  message  is  injected  to  the  ( Pe) 
FIFO  from  the  tester.  Due  to  the  shorl-eut  FIFO  design, 
the  .second  message  quickly  traverses  the  (Pe)  I'lI'O  to 
trail  the  first  message. 
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5.  Two  packets  in  ( • I  IFO  and  (l*e)  TITO  request  the  same 
channel  concurrently.  This  scenario  ensurc's  that  fair  arbi¬ 
tration  is  performed  w  hen  resolving  conllicts. 

file  PiRC  performed  successfully  for  all  possible  combina¬ 
tions  of  the  abiw  e  scenarios  for  two  sets  of  virtual  channels. 

V.  Conclusion 

fast,  simple  router  for  the  Data-Intensive  Architecture 

[ 1] 1  V. \ )  system  has  been  presented.  I  bis  devi ce.  the  PISf  Rout¬ 
ing,  Conipoiifiit  (PiRC).  implements  the  Red  Rover  routing  al¬ 
gorithm  to  achieve  high  performance  with  minimal  comple.vit). 
file  PiRC  has  advantages  of  simple  logic,  one  clock  node-to- 
node  delay,  high  channel  throughput,  and  robust  speed  con¬ 
sistency.  regardless  of  the  number  of  Hit  buffers  in  a  channel 
buffer.  Phis  combination  of  attributes  makes  the  PiRC  ideal 
for  the  DIVA  system. 
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Abstract 

In  this  paper,  mv  describe  an  algorithm  and  implenienia- 
tion  of  loealiry  optimizations  for  architectures  with  instruc¬ 
tion  sets  such  as  Intel’s  SSE  and  Motorola's  AltiVee  that 
support  operations  on  superwonis,  i.e.,  agitregate  objects 
consisting  of  several  machine  wonis.  MV  treat  the  lante  su- 
penvonl  register  file  as  a  compiler-controlled  cache,  thus 
ervoidinf!  unnecessary  memory  accesses  by  exploiting  reuse 
in  superwon!  res(isters.  This  research  is  distintiuished  from 
previous  work  on  exploiting  reuse  in  scalar  registers  be¬ 
cause  it  considers  not  only  temporal  hut  also  spatial  reuse. 
.\s  compared  to  optimizations  to  e.xploit  retise  in  cache, 
the  compiler  must  aho  manage  replacement,  and  thus,  e.x- 
plicitly  name  registers  in  the  generated  code.  We  describe 
an  implementation  of  our  approach  integrated  with  a  com¬ 
piler  that  e.xploits  superwonl-level  parallelhm  I.SLPi.  MV 
present  a  set  of  results  derived  automatically  on  4  multime¬ 
dia  kernels  and  2  scientific  benchmarks.  Our  results  show 
speedttps  ranging  from  1.3  to  2.f<X  on  the  6  programs  as 
compared  to  u.sing  SI.P  alone,  and  mv  eliminate  the  major¬ 
ity  of  memory  acces.ses. 

1  Introduction 

In  response  to  the  increasing  importance  of  multime¬ 
dia  applications  in  embedded  and  general-purpose  com¬ 
puting  environments,  many  microprocessors  now  incorpo¬ 
rate  an  expanded  instrucUon  set  and  architectural  extensions 
specifically  targeting  multimedia  requirements.  The  core 
component  of  such  architectural  extensions  is  a  functional 
unit  that  can  operate  on  aggregate  objects,  performing  bit- 
level  operations,  or  SIMD  parallel  operations  on  variable¬ 
sized  fields  in  the  object  (e.g.,  8.  16.  32  or  64-bit  fields).  If 
the  aggregate  objects  are  larger  than  the  size  of  a  machine 
word,  then  they  are  called  superwonis  [20].  Examples  in¬ 
clude  Motorola’s  .\ItiVec  and  Intel’s  SSE.  a  descendant  of 
MMX.  If  the  same  size  as  die  machine  word,  then  individ¬ 
ual  fields  are  referred  to  as  .suhsvords  [22].  A  related  class 


of  architectures  employ  processing-in-memory  ( PIM )  tech¬ 
nology  to  exploit  the  high  memory  bandwidth  when  pro¬ 
cessing  logic  is  combined  on  chip  with  large  amounts  of 
DRAM;  several  PIM-based  architectures  rely  on  superword 
parallelism  to  make  more  effective  use  of  available  memory 
bandwidth  [2.  17,  3,  11], 

While  multimedia  extension  and  related  architectures 
have  been  available  for  some  time,  convenient  method¬ 
ologies  for  developing  application  code  that  targets  these 
extensions  are  in  their  infancy.  There  is  recent  com¬ 
piler  research  for  such  architectures  to  automatically  exploit 
superword-level  parallelism,  performing  computations  or 
memory  accesses  in  parallel  in  a  single  instruction  is¬ 
sue  [20. 27,  8,  10,  1], 

In  this  paper,  we  recognize  an  additional  optimization 
opportunity  not  addressed  by  this  previous  work.  An  im¬ 
portant  feature  of  all  such  architectures  is  a  register  file  of 
superwords  (e.g.,  each  1 28  bits  wide  in  an  AltiVec),  usually 
in  addition  to  the  scalar  register  file.  A  set  of  32  such  su- 
perw'ord  registers  represents  a  not  insignificant  amount  of 
storage  close  to  the  processor.  Accessing  data  from  super¬ 
word  registers,  versus  a  cache  or  main  memory,  has  two 
advantages.  The  most  obvious  advantage  is  lower  latency 
of  accesses;  even  a  hit  in  the  LI  cache  has  at  least  a  1 -cycle 
latency.  Accesses  to  other  caches  in  the  hierarchy  or  to  main 
memoiy  carry  much  higher  latencies.  Another  advantage  is 
the  elimination  of  memory  access  instructions,  thus  reduc¬ 
ing  the  number  of  instructions  to  be  issued. 

In  this  paper,  we  treat  the  superword  register  file  as  a 
small  compiler-controlled  cache.  We  develop  an  algorithm 
and  a  set  of  optimizations  to  exploit  reuse  of  data  in  super¬ 
word  registers  to  eliminate  unnecessary  memory  accesses, 
which  we  call  superwonl-level  locality.  We  evaluate  the 
effectiveness  of  these  superword-level  locality  (SLL)  op¬ 
timizations  through  an  implementation  integrated  with  the 
algorithm  for  exploiting  superw'ord-level  parallelism  (SLP) 
presented  in  [20], 

Our  approach  is  distinguished  from  previous  work  on  in¬ 
creasing  reuse  in  cache  [9,  12,  14.  15,  16,  19,  28.  30],  in  that 
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Original 

SEP  only 

Scalar  register  reuse  only 

SEP  and  SEE 

Figure  1(a) 

Figure  1(b) 

Figure  1(d) 

Figure  1(f) 

2/r  4  ;r  .su’.s 

ir  2  -1- 11 

[11-12  +  11  1 /sirs 

11^  sir  a 

n^ 

11^  fsws 

Table  1.  Number  of  array  accesses  under  different  optimization  paths. 


the  compiler  must  also  manage  replacement,  and  thus,  ex¬ 
plicitly  name  the  registers  in  the  code.  .\s  compared  to  pre¬ 
vious  work  on  exploiting  reuse  in  scalar  registers  [30, 5, 23]. 
the  compiler  considers  not  just  temporal  reuse,  but  also  spa¬ 
tial  reuse,  for  both  individual  statements  and  groups  of  ref¬ 
erences.  Further,  it  also  considers  supenx^ord  parallelism  in 
making  its  optimization  decisions.  Exploiting  spatial  and 
group  reuse  in  superword  registers  requires  more  complex 
analysis  as  compared  to  exploiting  temporal  reuse  in  scalar 
registers,  to  determine  which  accesses  map  into  the  same 
superw'ord. 

The  contributions  of  this  paper  are  as  follow's; 

•  An  algorithm  for  exposing  opportunities  for  compiler- 
controlled  caching  of  data  in  superw'ord  register  files. 

•  A  description  of  a  set  of  optimizations,  which  in  ag¬ 
gregate  we  call  superword  replacement,  for  exploiting 
superword  register  reuse. 

•  Experimental  results,  derived  automatically,  compar¬ 
ing  performance  of  six  benchmarks/multimedia  ker¬ 
nels  optimized  for  parallelism  only,  SEP,  and  opti¬ 
mized  for  botli  parallelism  and  superword-level  local¬ 
ity.  Our  results  show  speedups  ranging  from  1 .3  to 
2.8X  as  compared  to  using  SEP  alone,  and  W'e  elimi¬ 
nate  the  majority  of  memory  accesses. 

The  remainder  of  the  paper  is  organized  into  5  sec¬ 
tions.  Section  2  motivates  the  problem  and  introduces 
terminology  used  in  the  remainder  of  the  paper.  Sec¬ 
tion  3  presents  the  main  superw'ord-level  locality  algorithm, 
which  performs  a  set  of  transformations  and  an  optimization 
search  that  exposes  opportunities  for  reuse  of  data  in  super¬ 
word  registers.  Section  4  presents  optimizations  to  actually 
achieve  this  reuse  of  data  in  superw'ord  registers.  Section 
5  presents  experimental  results  derived  automatically  by  an 
implementation  in  the  Stanford  SUIF  compiler.  Section  6 
discusses  related  word  and  Section  7  presents  conclusions 
and  future  work. 

2  Htick^roiiiicl  and  Motivation 

In  many  cases  superw’ord-level  parallelism  and 
superw'ord-level  locality  are  complementary  optimiza¬ 
tion  goals,  since  achieving  SEP  requires  each  operand  to 
be  a  set  of  words  packed  into  a  superword,  which  happens, 
with  no  extra  cost,  when  an  array  reference  with  spatial 


reuse  is  loaded  from  memory  into  a  superword  register. 
Therefore,  in  many  cases  the  loop  that  carries  the  most 
superw'ord-level  parallelism  also  carries  the  most  spatial 
reuse,  and  benefits  from  SEE  optimizations.  In  this  paper, 
we  achieve  SEE  and  SEP  somewhat  independently,  by 
integrating  a  set  of  SEE  optimizations  into  an  existing  SEP 
compiler  [20].  The  remainder  of  this  section  motivates  the 
SEE  optimizations. 

.\chieving  locality  in  superword  registers  differs  from  lo¬ 
cality  optimization  for  scalar  registers.  To  exploit  temporal 
reuse  of  data  in  scalar  registers,  compilers  use  scalar  re- 
pUicemeni  to  replace  array  references  by  accesses  to  tempo¬ 
rary  scalar  variables,  so  that  a  separate  backend  register  al¬ 
locator  will  exploit  reuse  in  registers  [5].  In  addition,  unroll- 
and-Jam  is  used  to  shorten  the  distances  between  reuse  of 
the  same  array  location  by  unrolling  outer  loops  that  carry 
reuse  and  fusing  die  resulting  inner  loops  together  [5].  In 
conventional  architectures  with  scalar  register  files,  spatial 
locality  can  only  be  obtained  in  caches. 

In  contrast,  a  compiler  can  optimize  for  superw'ord-level 
locality  in  superw'ord  registers  locality  through  a  combina¬ 
tion  of  unroll-and-jam  and  superwonl  replacement.  These 
techniques  not  only  exploit  temporal  reuse  of  data,  but 
also  spatial  reuse  of  nearby  elements  in  the  same  super¬ 
word.  In  fact,  even  partial  reuse  of  supenxords  can  be 
exploited  by  merging  the  contents  of  tw'o  registers  con¬ 
taining  superwords  that  are  consecutive  in  memory  (see 
Section  4.3).  Thus,  as  is  common  in  multimedia  applica¬ 
tions  [25],  streaming  computations  with  little  or  no  tem¬ 
poral  reuse  can  still  benefit  from  spatial  locality  at  the 
superw'ord-register  level,  as  well  as  at  the  cache  level. 

While  cache  optimizations  are  beyond  the  scope  of  this 
paper,  we  observe  that  the  SEE  optimizations  presented 
here  can  be  applied  to  code  that  has  been  optimized  for 
caches  using  well-known  optimizations  such  as  unimodu- 
lar  transformations,  loop  tiling  and  data  prefetching.  When 
combining  loop  tiling  for  caches,  superword-level  paral¬ 
lelism  and  superw'ord-level  locality  optimizations,  the  tile 
sizes  should  be  large  enough  for  superw'ord-level  paral¬ 
lelism.  and  for  unroll-and-jam  and  superw'ord  replacement 
to  be  profitable. 

These  points  are  illustrated  by  way  of  a  code  example, 
with  the  original  code  shown  in  Figure  1(a).  This  example 
show's  three  optimization  paths.  Figure  Kb)  optimizes  the 
code  to  achieve  superw'ord-level  parallelism.  Here.  .su\s,  an 
abbreviation  for  superword  size,  is  the  number  of  data  ele- 
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for(i=0;  i<n;  I++) 
for(i=0;j<n;J++) 

ii|ilUI  =  a|i-l||.ll*h|i|  +  Ni+l|: 

(a)  Original  loop  ncsl. 

forlW);  Kn;  i++) 
for  (JaO;  j<n;  j+=s»'s) 

a|i||j:j-hiws-l|  =a|i-l||,|:J+sws-l  |  *  h|i|  +  b|i+l|L 

(b)  Aflcr  superword-lovvl  paralldismlj  loop). 

lbr(i=0;  Kn;  i+=2) 
fiir(jrf);j<n;j++)  { 

a|l||JI  =  a|i-l|lj|*b|i|+b|l+l|: 
a|i+l||JI  =  a|i||j|*b|l+l|  +  b|i+2|; 

I 

(c)  Unixill-and-jam  on  Ihc example  in  (a)(i  loop). 

tmpi  =b|()|; 

for(W);  Kn;  i+=2)  | 

Imp2  =  b|i+I|; 
tnip2  =  b|i+2|; 
fnr(j=«;j<n;j++)  { 

linp4  =  a|l-l  nil  •  tinpl  +  Imp2; 
n|i+l||JI  =lmp4  •  lmp2  +  lmp.4; 
a|i||j|  =  tmp4; 

I 

Impl  =lmp.l; 

I 

(d|  .\rier  scalar  leplacement  on  the  code  in  (c). 

for(W);  Kn;  l+=2) 
for  (|a0;  j<n;  J+=  sws)  | 

a|l|[i:J+sws-l|  =a|i-IH|:j+sws-l  I  •  b|l|  +  b|i+l|; 
a|i+l||J:J+sws-l|  =  a|i||J:|+sws-l |  •  b|i+l|  +b|i+2|; 


(e)  Unmll-anil-jam  nn  Ihe  example  in  (b)(i  loop). 

impl  |0:sws- 1 1  =  b|0:s»vl  |; 
slrnpl  =tmpl|0|; 
sinip2  =  lmpl|l|: 

Held  =2; 

l<ir(i=0;  Kn;  l+=2)  ( 

//  ’lield'  dcnoles  an  index  into  ’impl’  l'nrslmp.4 
Ifdield  ==  0) 

lmpl|0:sws-l|  =b|i+2:i+sw5+l|; 
slmp-4  =impl|lield|; 
for  (J=();  J<n;  j+=  sws)  | 

tmp2|0:swN-l|  =  a|i- l|||:j+.sws-l|  ‘slmpl  +stmp2; 
a|l+l||J:j+sws-l|  =  lmp2jo:sws-l|  •  slnip2  +  simp?; 
a|i||J:J+sws-l|  =lmp2|():sws-l|; 

1 

stmpi  =stmp3; 
stmp2  =  Imp  1  (licld+ 1 1; 

Held  =  (ricld+2)'5Esws: 

1 

(f)  .Viter  superword  rcplaa’menl  on  cxide  in  (e) 


Figure  1 .  Example  code. 


ments  that  fit  within  a  superw’ord.  For  example,  if  a  and  b 
are  32-bit  float  variables,  on  a  machine  w'ith  128-bit  super- 
words.  .st/'.s  =  4.  In  Figures  1(c)  and  (d),  we  show  how'  the 
original  program  can  instead  be  optimized  to  exploit  reuse 
in  scalar  registers,  using  unroll-and-jam  and  scalar  replace¬ 
ment.  respectively.  In  Figures  1(e)  and  (f).  w'e  combine 
these  ideas,  using  unroll-and-jam  and  superword  replace¬ 
ment,  respectively,  to  transform  the  code  in  (b)  for  both 
superw'ord-level  parallelism  and  superw'ord-level  locality. 

Table  1  shows  how  the  three  different  optimization  paths 
affect  tile  number  of  array  accesses  to  memory  in  the  final 
code.  The  original  code  has  reads  and  writes  to  array  a 
and  2ii-  reads  to  array  b.  Exploiting  superword- level  par¬ 
allelism  in  loop  j,  as  in  Figure  1(b)  reduces  tlie  number  of 
reads  and  w'rites  to  array  a  by  a  factor  of  .sh’.s  since  each 
load  or  store  operates  on  sirs  contiguous  data  items;  for 
array  b,  there  is  no  change  since  the  array  is  indexed  by  i 
rather  than  j.  If  instead  the  code  was  optimized  for  scalar 
register  reuse,  as  in  Figure  1(d),  w'e  can  reduce  the  num¬ 
ber  of  array  reads  of  a  down  by  a  factor  of  2,  and  reads 
of  b  by  a  factor  of  ii,  with  the  number  of  writes  remaining 
the  same.  By  combining  superword-level  parallelism  and 
superw'ord-level  locality  as  in  Figure  1(0,  w'e  see  that  the 
number  of  reads  and  writes  is  further  reduced  by  a  factor 
of  .su'.s.  Figure  1(f)  illustrates  some  of  the  challenges  in 
exploiting  reuse  in  superw'ords.  Analysis  must  identify  not 
just  temporal,  but  also  spatial  reuse,  and  for  both  individ¬ 
ual  statements  and  groups  of  references.  The  compiler  also 
must  generate  the  appropriate  code  to  exploit  this  reuse;  for 
example,  we  select  scalar  fields  of  b  from  the  superword. 
since  w'e  are  not  parallelizing  the  i  loop. 

The  remainder  of  this  paper  describes  how  the  com¬ 
piler  automatically  generates  code  such  as  is  showm  in  Fig¬ 
ure  1(f).  and  the  performance  improvements  that  can  be  ob¬ 
tained  w'ith  this  approach. 

3  Superword-Level  Locality  .Algorithm 

The  superword-level  locality  algorithm  has  four  main 
steps,  as  summarized  in  the  next  subsection.  .\t  the  heart 
of  the  algorithm  is  an  approach  for  counting  both  memory 
accesses  and  register  requirements  for  storing  reused  data, 
which  is  the  subject  of  the  subsequent  subsection. 

3. 1  Steps  of  .Algorithm 

Step  1:  Idenlirviiig  Reuse.  First,  w'e  identify  array  vari¬ 
ables  and  loops  carrying  temporal  or  spatial  reuse.  We  ex¬ 
amine  the  dependence  graph,  looking  for  references  that 
have  loop-carried  consistent  dependences  (i.e.,  constant  de¬ 
pendence  distances)  or  are  loop  invariant  with  one  of  the 
loops,  and  so  have  opportunities  for  data  reuse  that  can  be 
exposed  by  unroll-and-jam. 

.Applying  unroll-and-jam  to  a  loop  with  a  loop-variant 
reference  creates  loop-independent  dependences  in  the  un- 
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rolled  loop  body.  In  the  example  in  Figure  1(a),  there  is  a 
true  dependence  between  references  -4[/][j]  and  .4[i  1]^/] 

with  distance  vector  (1,0).  After  unroll-and-jam.  a  loop- 
independent  dependence  is  created  between  -4  [/][/]  in  the 
first  statement  and  .4[(][/]  in  the  second  statement,  cre¬ 
ating  a  reuse  opportunity.  Similarly,  spatial  and  group- 
temporal  reuse  can  be  exposed  by  unroll-and-jam  when  a 
reference  has  a  loop-carried  dependence  with  the  loop  that 
traverses  the  lowest  array  dimension.  For  loop-invariant  ref¬ 
erences.  unroll-and-jam  generates  loop-independent  depen¬ 
dences  betw’een  the  copies  of  the  reference  in  the  unrolled 
loop  body. 

Step  2:  IK‘leriiijiiiiig  unroll  factors  for  candidate  loops. 
The  algorithm  next  determines  the  unroll  factors  for  each 
candidate  loop  that  carries  reuse  and  for  which  unroll-and- 
jam  is  legal,  with  the  following  goal. 

Optimiziiikm  Goal:  Find  unroll  factors 

(A'l,  A'2,  ...A'n)  for  loops  1  to  V  in  a  /i-deep  loop 
nest  such  that  the  number  of  memory'  accesses 
is  minimized,  subject  to  the  constraint  that  the 
number  of  superw'ord  registers  required  does  not 
exceed  w’hat  is  available. 

The  search  algorithm  uses  the  reuse  information  and  the 
number  of  registers  available  to  prune  the  search  space,  as 
follows.  Loops  that  carry  no  reuse  are  not  included  in  the 
search.  Next,  we  obsen'e  that  for  each  unrolled  loop  I,  the 
amount  of  reuse  of  an  array  reference  with  reuse  carried  by 
I  increases  with  the  unroll  factor  A/.  Therefore  reuse  is  a 
monotonic,  non-decreasing  function  of  the  unroll  factor  for 
each  loop,  given  that  the  unroll  factor  of  all  other  loops  are 
fixed.  The  algorithm  uses  this  property  to  prune  the  search 
space,  avoiding  searching  for  all  possible  unroll  factors  for 
a  given  loop.  It  traverses  the  search  space  by  varying  the 
unroll  factor  of  one  loop  while  keeping  the  unroll  factor  of 
all  other  loops  fixed.  .4  binary  search  within  a  dimension 
can  further  prune  the  search.  .\lso,  the  unroll  factor  of  each 
loop,  given  that  all  other  unroll  factors  are  fixed,  is  limited 
by  the  number  of  registers  available.  Once  the  search  finds 
an  unroll  factor  for  a  given  loop  that  exceeds  the  register 
limit,  it  prunes  all  larger  unroll  factors  for  that  loop  from 
the  search  space. 

To  guide  the  search  towards  the  above  optimization  goal, 
we  calculate  the  mpcrwonl  footprint,  which  represents  the 
number  of  superwords  accessed  by  the  unrolled  iterations 
of  the  loop  nest,  as  a  function  of  the  unroll  factor.  The  su¬ 
perword  footprint  can  be  used  both  to  count  how  many  reg¬ 
isters  are  required  to  hold  the  accessed  data,  as  well  as  how 
many  memory  accesses  remain  in  the  loop  nest.  Assuming 
that  all  variables  are  kept  in  registers  when  the  superword 
footprint  fits  in  the  superw’ord  register  file,  the  number  of 
memory  accesses  associated  with  a  set  of  references  is  sim¬ 
ply  the  superword  footprint  for  the  references  multiplied  by 


the  bounds  of  the  loops  in  w’hich  they  are  nested  after  un¬ 
rolling.  Our  method  for  selecting  unroll  factors  based  on  re¬ 
quired  superword  registers  differs  from  related  approaches 
oriented  towards  scalar  registers  [5],  accounting  for  not  only 
temporal  but  also  spatial  and  group  reuse.  In  the  next  sub¬ 
section.  we  describe  in  detail  the  calculation  of  the  super¬ 
word  footprint. 

Step  4:  I'nrtill-aiKl-.lani  and  Siiperword  Keplaceiiieiil. 
Once  the  unroll  factors  are  decided,  the  loop  nest  is  trans¬ 
formed  and  array  references  are  replaced  with  accesses  to 
superw'ord  temporaries,  as  discussed  in  Section  4. 

3.2  Coiiipiitiiij>  Hie  Siiperword  Footprint 

The  algorithm  for  computing  the  superword  footprint  for 
a  loop  nest  first  partitions  the  references  in  the  loop  into 
groups  of  itnifontily  generated  references  [30].  that  is,  ref¬ 
erences  to  the  same  array  such  that,  for  each  array  dimen¬ 
sion,  the  array  subscripts  differ  only  by  a  constant  term'. 
Then,  for  each  group  of  references,  it  computes  the  regis¬ 
ters  needed  to  keep  the  data  accessed  in  the  unrolled  loop 
body.  Finally,  the  total  number  of  registers  is  computed  as 
the  sum  of  those  of  each  group  of  uniformly  generated  ref¬ 
erences.  We  first  discuss  how  to  compute  the  registers  re¬ 
quired  for  a  single  reference  as  a  function  of  the  unroll  fac¬ 
tors  of  each  unrolled  loop.  Then  we  discuss  how  to  compute 
the  register  requirements  for  a  group  of  uniformly  generated 
references.  The  registers  requited  for  such  a  group  may  be 
smaller  than  the  sum  of  the  registers  required  for  each  ref¬ 
erence,  if  computed  individually,  since  the  same  superword 
may  be  accessed  by  two  or  more  copies  of  the  original  ref¬ 
erences  when  the  loops  are  unrolled. 

Our  method  determines  the  number  of  superword  reg¬ 
isters  requited  to  hold  the  data  accessed  by  the  loop  refer¬ 
ences  in  the  unrolled  loops.  However,  extra  registers  may  be 
needed  to,  for  example,  align  a  superw'ord  operand  which 
is  already  kept  in  superword  registers.  That  is.  the  com¬ 
putation  may  require  more  registers  than  those  needed  for 
storing  the  data.  Therefore,  we  reserve  some  scratch  regis¬ 
ters  for  manipulating  data  and  compute  the  number  of  regis¬ 
ters  needed  just  for  storing  the  data  accessed  in  the  unrolled 
loops. 

To  simplify  the  presentation.  W'e  assume  a  loop  nest  of 
depth  n  where  all  array  references  have  array  subscripts 
that  are  affine  functions  of  a  single  index  variable  (SIV  sub¬ 
scripts)*.  We  also  assume  that  each  p-dimensional  array  ref¬ 
erenced  by  the  loop  is  defined  as  .4[.s'i][.s2] . . .  [.Sp],  where 
.srf  is  the  size  of  dimension  d.  I  <  d  <  p.  Dimension 
p  is  the  lowest  dimension  of  the  array,  i.e.,  the  dimension 

'  Wo  assume  that  two  or  more  referonex’s  tlial  atxxiss  the  same  array  but 
are  not  uniformly  generalod  actxss  dislinel  data  in  memory,  whieh  rosulls 
in  a  eonservalive  eslimale  of  the  numtvr  of  registers. 

-tXir  current  Implomentalion  can  handle  alline  SIV  subsaipls  and  cer¬ 
tain  altinc  MIV  subscripts. 
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Figure  2.  Superword  footprint  of  a  single  ref¬ 
erence. 


in  which  consecutive  elements  are  in  consecutive  memory 
locations.  A  reference  v  to  array  .4  is  then  of  the  form. 
.4[oi  *  li  +  6i][n2  ♦  I2  +  . . .  [ftp  *  /p  +  /rp].  Similarly, 

the  array  subscripts  of  the  uniformly  generated  references 
('1,  ('2>  —  >'m  in  dimension  d  are  aj  *  /j  +  In,  aj  *  /j  +  I12, 
....  rid  ♦  III  +  l>mi  respectively.  Thus,  a  reference  with  SIV 
subscripts  has  each  array  dimension  associated  with  just  a 
single  loop  index  variable  in  the  nest.  We  also  assume  that 
the  arrays  are  aligned  to  a  superword  in  memory  and  that 
the  loops  are  nonnalized. 

4.2. 1  Siiperwoixl  I'ootpi'inl  of  a  Single  Reference 

For  each  reference  i'  w’ith  array  subscripts  ad*lj+b,  where  d 
is  the  array  dimension  and  Id  is  the  loop  variable  appearing 
in  subscript  d.  the  number  of  registers  required  to  keep  the 
data  referenced  by  v  when  Id  is  unrolled  by  A'/^  is  given  by 
Ihe  Mipenuml  fooiprinl  of  ft  in  /j,orF/j(ft).  The  superword 
footprint  consists  of  the  superw'ords  accessed  by  all  copies 
of  ft  resulting  from  unrolling. 

When  dimension  d  is  the  lowest  array  dimension  (d  =■ 
p).  the  superword  footprint  is  given  by  Equation  ( 1 ).  Equa¬ 
tion  (la)  corresponds  to  the  footprint  of  a  loop-invariant 
reference.  Equation  ( 1  b)  corresponds  to  the  footprint  of  a 


reference  w’ith  self-spatial  reuse  within  a  superword,  as  il¬ 
lustrated  in  Figure  2(a),  and  ( Ic)  holds  when  the  reference 
has  no  spatial  reuse. 

{1  (a)  if  rid  =  0 

ifft</<.sft'.s  (1) 
A'/j  (c)  if  rid  '>  .su’.s 

When  d  is  one  of  the  higher  dimensions.  1  <  d  <  p, 
and  loop  Id  is  unrolled,  the  offset  between  llie  footprints  of 
each  copy  of  v  is  rid  *  Ofed+i  •'*<>  where  .Sj  is  the  size  of 
the  array  dimension,  as  shown  in  Figure  2(b).  .Assum¬ 
ing  that  the  size  of  the  lowest  array  dimension  (.Sp)  is  larger 
than  nil’s,  which  is  usually  the  case  in  practice  for  realistic 
array  dimensions,  each  copy  of  ft  in  the  unrolled  loop  body 
corresponds  to  a  separate  footprint,  as  shown  in  Figure  2(b). 
Therefore  the  size  of  the  footprint  of  ft  in  /  j  is  the  sum  of  the 
A';,,  disjoint  footprints,  and  is  recursively  defined  by  Equa¬ 
tion  (2).  where  Fij,(v)  is  computed  as  in  Equation  (1). 

F<,(ft)  =  -Y,,*F,„,(ft) 

p-i 

=  (2) 
i=d 

For  a  single  reference,  the  number  of  superw'ord  registers 
given  by  Equation  ( 1 )  and  the  number  of  scalar  registers  that 
would  be  required  if  the  same  unroll  factors  were  used  differ 
only  when  rid  <  sirs,  that  is.  when  spatial  reuse  can  be 
exploited  in  superword  registers.  For  a  group  of  uniformly 
generated  references  the  analysis  must  also  consider  group 
reuse,  as  discussed  next. 

.V2.2  Siiperword  Footprint  of  a  Rcfciviice  (iroiip 

The  number  of  registers  required  to  keep  a  group  of  uni¬ 
formly  generated  references  V’  =  { I'l,  I’a- •••. ''m  }  when 
loop  Id  is  unrolled  by  A';.,  is  the  superw'ord  footprint  of  the 
group,  Fij{V).  The  superword  footprint  of  a  group  consists 
of  the  union  of  the  footprints  of  the  individual  references, 
as  some  of  the  reference  footprints  may  overlap,  depend¬ 
ing  on  the  distance  between  the  constant  terms  in  the  array 
subscripts. 

The  footprints  of  two  uniformly  generated  references 
may  overlap  in  dimension  d  only  if  they  overlap  in  all  di¬ 
mensions  higher  than  ri.  For  example,  the  footprints  of  ref¬ 
erences  -4[2/] \j-\-2\  and  [2i  -t-lj  [/]  do  not  overlap  in  the  high¬ 
est  (row)  dimension,  since  the  first  reference  accesses  the 
even-numbered  row's  of  the  array  and  the  second  accesses 
the  odd-numbered  row's.  Therefore  the  footprints  cannot 
overlap  in  the  low'est  (column)  dimension.  On  the  other 
hand,  the  footprints  of  .4[2)][j  4-  2]  and  .4[2i  -f  4][j]  overlap 
in  the  row'  dimension  for  iterations  ii,  12,  1  <  (1,  h  -Yj, 
such  that  2/j  =  2/2  -f  4.  For  the  iterations  of  i  in  which 
the  footprints  overlap  in  the  row  dimension,  the  footprints 
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may  overlap  in  the  column  dimension  if  there  exist  itera¬ 
tions  1  <  j\.j2  <  -Vj,  such  that  ji +  2  =  j2- 

The  superw'ord  footprint  of  a  group  V'  in  a  set  of  un¬ 
rolled  loops  is  computed  as  follows.  For  each  dimension 
(/,  from  highest  to  low'est  dimension,  the  footprint  is  com¬ 
puted  assuming  that  the  footprints  of  the  references  in  the 
group  overlap  in  the  higher  dimensions.  For  each  dimen¬ 
sion  (I  <  p,  the  algorithm  partitions  references  into  subsets 
such  that  each  subset  corresponds  to  a  disjoint  footprint  in 
dimension  d.  Then,  for  each  subset,  the  algorithm  recur¬ 
sively  computes  the  footprint  in  dimension  rf  +  1,  as  we 
now'  describe. 

Dinu'iisiiiii  d  is  the  lowest  (liiiieiision  (d  =  p).  We  first 
compute  the  group  footprint  of  two  array  references,  and 
then  we  extend  it  for  in  references.  The  group  footprint  of 
tw'o  references  j,  with  low'est  dimension  subscripts 

(Id  *  Id  +  l>i  and  lid  *ld  +  l>2  such  that  hi  <  w'hen  loop 
Id  is  unrolled  by  A'/j  is  given  by  Equation  (3)  in  Figure  3. 

Equations  (3a).  (3b)  and  (3c)  correspond  to  combi¬ 
nations  of  two  basic  conditions  which  determine  the  super¬ 
word  footprint  of  a  pair  of  uniformly  generated  references. 
The  first  condition  is  w'hether  the  references  have  self- 
spatial  reuse  w'ithin  a  superw'ord.  that  is.  whether <  .Sli'.S. 
The  second  is  whether  the  footprints  may  overlap,  w'hich  is 
the  case  when  {1*2  —  bj)  <  a,i 

Figure  3  show's  four  examples  of  superword  footprints 
corresponding  to  Equation  (3).  Figure  3(a)  corresponds  to 


Equation  (3a).  w'here  the  footprints  may  overlap  and  the 
group  footprint  is  the  union  of  the  tw'o  footprints.  Each  of 
the  individual  footprints  is  a  set  of  A'/j  superw'ords  since  the 
references  have  no  spatial  reuse.  The  footprints  overlap  if 
(l>2  ~  ^i)  is  evenly  divided  by  iij  and  there  exists  an  integer 
value  k,  1  <  k  <  A’/^, such  that  A-  =  1  -I- (1/2  l>i)/<id-  This 

equation  precisely  computes  the  overlapped  footprint  w'hen 
the  tw'o  footprints  have  group  temporal  reuse.  For  group 
spatial  reuse,  w'e  consen'atively  approximate  the  footprint 
w'ith  Equation  (3c).  In  Figure  3(b)  the  footprints  of  iq  and 
i'2  overlap,  and  both  references  have  spatial  reuse  within 
a  superw'ord.  The  corresponding  footprint  size  is  given  by 
Equation  (3b). 

Figures  3(c)  and  3(d)  correspond  to  Equation  (3c),  w'here 
the  footprints  do  not  overlap  and  therefore  the  group  foot¬ 
print  is  the  sum  of  the  individual  footprints.  In  Figure  3(c) 
i'l  has  no  self-spatial  reuse  and  each  copy  of  iq  in  the 
unrolled  loop  body  accesses  a  distinct  superword,  and  the 
same  is  tme  for  (’2.  In  Figure  3(d)  both  (q  and  ('2  have  su¬ 
perword  spatial  reuse. 

The  number  of  registers  required  for  reference  group 
V  =  {»q,  iq,...,  I'm}  is  computed  by  extending  the  equa¬ 
tions  above  to  more  than  tw'o  references.  Here  w'e  describe 
the  most  interesting  case  (corresponding  to  EEquation  (3b)), 
W'here  the  footprints  overlap  and  the  references  have  spa¬ 
tial  reuse.  A  subset  group  Vi  =  \  i'i„„  } 

is  defined  by  low'est  dimension  subscripts  iip  *  Ip  +  hj. 
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'min  j  <  >max,  where  the  references  have  been  sorted 
so  tliat  fcj_i  <  bj.  V'i  has  a  footprint  consisting  of  contigu¬ 
ous  superwords  if  there  is  self-spatial  reuse  (rip  <  sws) 
and  possible  overlap  (bj  —  bj-i  <  tip  *  Xi^,)  for  all  j 
such  that  imin  <  j  <  imai-  To  Compute  tlie  number  of 
registers  required  for  the  entire  group,  the  algorithm  par¬ 
titions  V’  into  disjoint  subsets  Vj  as  defined  above,  where 
^ min  ^  J  — 


{bj  -  bj^i  <  (Ip  ♦  A'/p)  A 

=  fn  V  bi^,„  -  1  >  (Ip  *  A'/J  A 

(^'imax  =  I'imat  +  l  ~  ^imax  ^  "p  *  -^/d)  (4) 

Each  subset  Vj  corresponds  to  a  footprint  of  contiguous 
superwords  consisting  of  the  union  of  the  individual  foot¬ 
prints.  with  size  given  by  Equation  (5 ). 

^ui^i)  —  ({''imln'  —t  *'>fnox  I  ) 

r lip  *  Afp  -f-  -  bi^,„  1 

.SU'.S 


The  total  number  of  superw'ord  registers  required  for  the 
references  in  1'  is  then  the  sum  of  the  disjoint  footprints  of 
the  sets  Vj,  as  in  (6). 


F,AV) 


ftp  *  -Y/p  -b  -  bi„ 


(6) 


Diiiu'iisiiin  d  is  luil  the  lo«tsl  dimension  (d  ^  p).  When 
(I  is  one  of  the  higher  dimensions,  the  superword  footprint 
ofV''  =  in  loop is  again  tlie  union  of  the 

individual  footprints. 

From  Section  3.2.1,  the  footprint  of  each  reference  vi  in 
the  unrolled  loop  body  consists  of  a  set  of  A'/^  disjoint  foot¬ 
prints.  where  each  of  the  A’/^  footprints  starts  at  superw'ord 
((Id  *ld  +  bj)*  niLd+i  ■'*i>  where  .s*  is  the  size  of  dimension 
/.and  1  <  /rf  <  A'/j. 

Therefore  tlie  footprints  of  different  references  in  the 
group  may  overlap  for  some  superw'ords.  depending  on  the 
values  of  «,/,  bj  and  the  unroll  factor  A’l^.  The  footprints 
of  two  uniformly  generated  references  iq  and  i'2  overlap 
in  dimension  d  if  there  exists  an  integer  value  k  such  that 
1  <  A-  <  A'(j  that  satisfies  Condition  7. 


(Id  *  k +  bi  =  (Id  +  1)2-  (7) 


Furthermore,  if  there  exists  A  satisfying  the  above  condition, 
the  footprints  corresponding  to  the  A  to  Xij  copies  of  iq  in 
the  unrolled  loop  body  overlap  with  those  corresponding  to 
the  first  A’lj  —  A  -I- 1  copies  of  i'2.  The  footprint  of  { tq,  (’2) 


is  then  given  by  Equation  (8). 

Fij{ri,i'2)  =  (h  -  1)  *F/j,,(iq) 

+  /i  +  1)  *F/j,  ,(iq,i>2) 

+  (^1  1) 

To  compute  the  size  of  tlie  entire  footprint  of  1'  in  /j,  our 
algorithm  partitions  1'  into  subsets  Iq  = 
such  that,  for  any  j,  imin  <  7  <  'max,  the  pair  } 

satisfies  Condition  (4).  The  footprint  of  Vq  is  the  union  of 
the  overlapped  footprints  of  its  reference  set  and  is  com¬ 
puted  by  extending  Equation  (8)  to  more  than  two  refer¬ 
ences. 

4  Optiiiii/atioiis  for  Siiperword  Replacement 

.\fter  the  appropriate  unroll  factors  are  determined  by  the 
algorithm  in  the  previous  section,  the  unrolled  code  is  then 
optimized  for  superw'ord-level  parallelism.  Not  until  after 
SEP  are  the  final  code  transformations  performed  to  actu¬ 
ally  exploit  reuse  in  superw'ord  registers.  In  this  section,  we 
briefly  describe  these  transformations. 

4.1  Keplacinj>  Kcdiinclant  Loads  and  Stores 

Our  compiler  replaces  redundant  loads  and  stores 
from/to  memory  with  accesses  to  superword  temporaries. 
Since  the  code  is  already  unrolled,  it  is  very  straightforward 
to  recognize  these  opportunities.  The  compiler  simply  de¬ 
termines  that  addresses  and  offsets  for  different  memory  ac¬ 
cesses  fit  within  tlie  same  superw'ord.  and  verifies  that  there 
are  no  intervening  kills  to  the  memory  locations. 

4.2  I’ackin}>  in  Siipcrword  Rejjistcrs 

.\s  part  of  SLP's  code  generation,  whenever  data  is 
packed  to  form  superwords,  this  is  done  through  memory. 
.\  data  element  is  loaded  into  a  scalar  register  from  the 
source  location  and  stored  to  the  destination  location.  Pack¬ 
ing  through  memory  is  in  some  sense  motivated  by  the  fact 
that  many  multimedia  extension  architectures  do  not  sup¬ 
port  register-to-register  transfers  between  scalar  and  super- 
word  register  files. 

In  our  system,  we  have  developed  an  optimization  we 
call  rex’iiter  packing,  showm  in  Figure  4.  to  perform  this 
packing  in  the  superw'ord  register  file.  We  take  advantage 
of  tw’o  instructions  that  are  common  in  multimedia  exten¬ 
sion  architectures,  w'hich  we  call  replicate  and  ihift-aiul- 
taad.  Replicate  replicates  one  element  of  a  source  register 
to  all  elements  of  a  destination  register.  Sliift-niid-liHul  takes 
tw'o  source  registers.  The  first  source  register  is  shifted  left 
by  the  amount  of  the  third  argument  and  the  same  amount 
is  taken  from  the  second  source  register  to  fill  the  destina¬ 
tion  register.  Packing  these  operands  in  superword  registers 
eliminates  numerous  scalar  loads  and  stores. 
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w  =  »((n(ial»)<Sa+0); 

X  =  •((llaal  •l&h  +  0); 
y=  •((lloal  »)&c  +  0); 

7  =  *((noal»)&d+0): 
•((float  •)&p  +  0)  =  w; 
•((float  •)&p+  t)  =  x; 
•((f1oat^)&p  +  2)  =  y: 
•((tloa(^)&p+.1)  =  z; 

(a)  I’actdng  ttin<u^h  mcmc^' 


Icnipl  =  rcptfoatcfa.  0): 

(i.-nip2  =  ivpffcatefb.  0); 
tomp>  =  replfcatefc.  0); 

(i.'mp4  =  i\.-pffcate(d.  0); 
p  =  shift-andJi«d(U:nipt.  tempt. 4); 
p=  shitt.andjiiadtp.  temp2. 4); 
p=  shill.and_li\ad(p.  temp4. 4); 
p  =  sliin.andj(\ad(p.  tcmp4. 4); 

(h)  Packing'  inregixtcix 


Figure  4.  Register  Packing 


4.3  Shifting  for  Partial  Reuse 

Spatial  reuse  within  a  superword  happens  when  distinct 
loop  iterations  access  different  data  in  the  same  superword. 
Piiitial  sputia!  reinc  of  superw’ords  occurs  when  distinct 
loop  iterations  access  data  in  consecutive  superwords  in 
memory,  partially  reusing  the  data  in  one  or  both  super- 
words.  as  shown  by  the  example  in  Figure  5,  and  illustrated 
graphically  in  Figure  .‘i(d).  In  this  example,  as  before  as¬ 
suming  that  sirs  =  4,  array  reference  !)[/  -I-  J]  has  partial 
spatial  reuse  in  loop  /.  For  a  fixed  value  of  i  and  j,  the  data 
accessed  in  iteration  (i,  j)  consists  of  the  last  three  words 
of  the  supetw'ord  accessed  in  iteration  {i  —  l.j),  plus  the 
first  word  of  the  next  superw'ord  in  memory'.  This  type  of 
reuse  can  be  exploited  by  shifting  the  first  word  out  of  the 
superw'ord.  and  shifting  in  the  next  word,  as  in  Figure  5. 
.\s  show'n  in  Figure  5(c).  only  two  superwords  need  to  be 
loaded  for  the  data  accessed  in  the  4  copies  of  6[t  4-  j]  in  the 
loop  body,  after  shifting  is  applied.  Before  shifting.  ft[(  +  y] 
had  to  be  loaded  from  memory  (and  aligned,  for  architec¬ 
tures  that  support  only  aligned  accesses)  for  each  of  the  four 
copies  of  fe[(  4-  j]  in  the  loop  body. 

Detecting  the  applicability  of  superword  shifting  is 
straightforw'ard.  involving  checking  the  dependence  dis¬ 
tance  on  the  l(X)p  for  small,  constant  distances.  Code  gener¬ 
ation  is  also  straightforw'ard.  since  multimedia  extension  ar¬ 
chitectures  support  efficient  shifting  and  permutation  mech¬ 
anisms  for  aligning  and  rearranging  data  in  superw'ords. 

5  Kxperiinentiil  Results 

This  section  presents  an  experiment  that  demonstrates 
the  dramatic  performance  improvements  that  can  be  derived 
from  compiler-controlled  caching  in  superw'ord  registers. 
We  describe  an  implementation  that  incorporates  superword 
register  locality  optimizations  into  an  existing  compiler  ex¬ 
ploiting  supeiw'ord-level  parallelism  [20].  We  present  a  set 
of  results  on  four  multimedia  kernels  and  two  scientific  ap¬ 
plications.  derived  automatically  from  our  implementation. 

5.1  Iinpleinentiition  and  Methodolo)>y 

Figure  6  illustrates  the  system  we  have  developed  for  this 
experiment,  which  uses  the  Stanford  SUIF  compiler  as  its 


liir  (i  =  0;  i  <  n;  i  ++) 
lor  Cl  =  O',  j  <  ikj  ++) 
a|ilUI  =  l'|i+il*c|.||; 

(a)  Original  loop  ncsl 

for  (1  =  0;  1  <  n;  1  +=  4) 
l<ir(l  =  0-.  j  <  rej  +=4)1 
a|l|lj:J+3|  =b|l+j:i+j+?l  *c||:j+.-!|; 
a|l+l||i:J+3|  =  b|l+i+l:l+i+4|  •  c|j:|+.f|; 
a|l+2||j:J+J|  =  b|l+i+2:l+J+.S|  •  c|.i:|+.t|; 
a|!+.f||j:J+J|  =  b|l+|+2:l+(+6|  •  c[j:J+.l|; 

I 

(b)  .MLt  unioll-and-jani  and  SI.P.  assuming  sws  =  4). 

lor  (1  =  0;  1  <  n;  1  +=  4) 
lor  0  =  0;  I  <  mj  +=4)1 
tmpi  |0:.f  I  =  b|l+J:l+J+.41; 
tmp2|():.f  I  =  b|l+j+4:l+j+7|; 
u|l||J:j+.’l  =  tmpi  |0:.f  I  •  c|j:j+.X|; 
shlllaindJoad  (lmpl|0:?|.  Imp2|0:-f  |.  1); 
all+l||j:j+.f|=tmpl|0:?rclj;.i+.x|; 
.shllljindjoad  (lmpll0:?|.  Imp2|0:-f  |.  1); 
a|i+2||j:j+.f|=tmpl|0:?rclj;.j+4|; 
.shlllaindJoad  (lmpl|0:?|.  lmp2|0:.f  |.  1); 
a|l+?||j:j+.f|  =lmpl|0:?|  •clj:j+J|; 

1 


(c)  Alter  shilling  across  suponvord  leglsten.. 


«>  R2 


(d)  Craphlcal  dcpIclUm  of  shilling. 

Figure  5.  Shifting  registers  for  partial  reuse. 


C  program 


Figure  6.  Implementation. 
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Table  2.  Benchmark  programs. 


underlying  infrastructure  [18].  The  input  to  the  system  is  a 
C  program,  which  is  tlien  optimized  by  passes  in  SUIF,  in¬ 
cluding  our  Superw'ord  Locality  analysis  described  in  Sec¬ 
tion  3,  followed  by  the  Superw'ord-Level  Parallelism  (SLP) 
optimization  passes  by  Larsen  and  Amarasinghe[20],  and 
finally,  an  optimization  pass  that  perfomis  superword  re¬ 
placement  as  described  in  Section  4  to  steer  the  compiler  to 
obtain  the  reuse  in  superword  registers  that  the  SLL  algo¬ 
rithm  determined  was  possible. 

The  output  from  the  SUIF  portion  of  the  system  is  an  op¬ 
timized  C  program,  augmented  with  special  superw'ord  data 
types  and  operations.  Currently,  the  resulting  code  is  passed 
to  a  Gnu  C  backend,  modified  to  support  superw'ord  data 
types  and  operations  for  the  Pow'erPC  .\ltiVec  instruction- 
set  architecture  extensions.  Eiach  superword  operation  cor¬ 
responds.  in  most  cases,  to  a  single  instruction  in  the  Al- 
tiVec  ISA.  The  role  of  the  GCC  backend  includes  replacing 
the  vector  operations  with  the  corresponding  .\ltiVec  super¬ 
word  instruction,  and  allocating  the  vector  data  types  to  the 
superw'ord  registers.  The  resulting  code  is  executed  on  a 
.^33  MHz  Macintosh  PowerPC  G4.  which  has  a  superword 
register  file  consisting  of  32  128-bit  registers. 

5.2  PeifoniiaiKe  Meastiffiiiont.s 

We  have  applied  the  previously-described  implementa¬ 
tion  to  four  of  the  five  multimedia  kernels  and  the  two  sci¬ 
entific  programs  from  the  Specfp95  benchmark  suite  for 
which  execution  time  speedups  were  reported  in  Larsen 
and  Amarasinghe.  summarized  in  Table  2  [20].  As  a  first 
step,  we  verified  that  we  could  reproduce  their  previously 
reported  results.  For  purposes  of  comparison,  we  initially 
followed  tlie  same  methodology  established  in  Larsen  and 
.\niarasinghe  [20]:  (1)  we  used  the  same  programs;  (2)  all 
versions  of  the  code  were  compiled  on  the  AltiVec  without 
optimization;  and.  (3)  baseline  measurements  were  derived 
by  compiling  the  unparallelized  code  for  the  PowerPC  G4. 
We  are  using  an  updated  implementation  of  SLP  from  what 
was  published,  as  well  as  a  faster  target  machine  and  new 
releases  of  GCC  and  the  Linux  operating  system,  so  there 
are  some  differences  in  results,  but  they  are  very  minor. 

Larsen  and  Amarasinghe  were  unable  to  use  optimiza¬ 
tion  on  the  .\ltiVec-extended  GCC  backend  at  the  time 
of  their  study,  but  in  the  intervening  time,  this  Motorola- 
supplied  backend  has  become  more  robust.  For  the  results 
presented  in  this  section,  we  modify  the  methodology  to 
perform  “-03"  optimizations.  To  understand  the  overall 


Figure  7.  Reduction  in  dynamic  memory 
accesses  due  to  superword  repiacement. 


benefits  of  exploiting  compiler-controlled  caching  in  super¬ 
word  registers,  we  have  compared  the  results  of  the  full  sys¬ 
tem  with  those  obtained  when  SLP  is  used  alone.  For  this 
reason,  we  report  results  where  SLP  is  applied  to  the  origi¬ 
nal  codes  and  compare  these  results  to  the  full  system. 

We  show  tw'o  sets  of  results.  First,  in  Figure  7(a).  we 
show  the  percentage  of  vector  loads  and  stores  eliminated 
by  the  full  system,  as  compared  with  SLP  alone.  Our  ap¬ 
proach  eliminates  over  50%  of  the  vector  loads  and  stores 
in  three  of  the  four  kernels,  and  over  85%  in  SWIM  and 
TOMC.\TV.  We  also  eliminate  scalar  loads  and  stores  using 
register  packing,  as  described  in  Section  4.  In  Figure  7(b), 
we  see  that  our  approach  eliminates  over  90%  of  the  scalar 
loads  and  stores  in  the  four  kernels,  and  over  35%  in  SWIM 
and  TOMCATV. 

Figure  8  shows  how  these  reductions  in  instructions 
translates  into  speedups  over  SLP.  To  isolate  the  benefits  of 
individual  components  of  our  system,  we  measure  the  per- 
fomiance  of  the  code  at  several  stages  of  the  optimization 
process.  The  first  bar,  normalized  to  1,  shows  the  results 
of  SLP  alone.  The  second  bar,  called  Unrolled+SLP,  shows 
the  results  of  running  the  first  portion  of  the  SLL  algorithm, 
described  in  Section  3,  which  performs  unroll-and-jam  on 
the  loop  nest  to  expose  opportunities  for  superword  reuse, 
and  following  up  with  SLP.  This  bar  isolates  the  impact  of 
unrolling,  since  it  is  not  until  after  SLP  that  this  reuse  is 
actually  exploited.  .\lso,  because  it  is  reordering  the  it¬ 
eration  space  to  bring  reuse  closer  together,  this  version 
will  also  obtain  locality  benefits  in  the  data  cache.  Thus, 
this  bar  provides  the  cache  locality  benefits  of  unroll-and- 
jam.  which  can  be  compared  against  the  additional  improve¬ 
ments  from  superw'ord  register  locality.  The  third  bar.  Su¬ 
perword  Replacement,  provides  speedup  using  Supenxord 
Replacement  and  Shifting,  as  described  in  Section  4.  The 
final  bar.  entitled  Register  Packing,  shows  the  additional 
improvement  due  to  this  technique,  also  described  in  Sec¬ 
tion  4. 

Overall,  we  see  tliat  in  combination,  applications  achieve 
speedups  between  1 .3  and  2.8  over  SLP  alone,  with  an  av¬ 
erage  of  2.2X.  Consideration  of  TOMC.\TV  and  SWIM 
show’s  that  both  programs  have  little  temporal  reuse,  al- 
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Figure  8.  Speedups  over  SLP  alone. 


though  there  is  a  small  amount  of  spatial  reuse  that  is  ex¬ 
ploited  with  our  approach,  particularly  in  TOMCATV.  We 
are  obtaining  a  locality  benefit  due  to  unroll-and-jam.  We 
also  observe  additional  SLP  due  to  iteration-space  splitting, 
motivated  by  the  need  to  create  a  steady-state  loop  where 
the  data  is  aligned  to  a  superw'ord  boundary'.  The  four  other 
programs  show  a  significant  improvement  from  superword 
replacement.  For  VMM.  MMM  and  FIR.  there  are  also 
huge  gains  due  to  register  packing. 

In  summary,  the  SLL  techniques  presented  in  this  paper 
dramatically  reduce  the  number  of  memory  accesses  and 
yield  significant  performance  improvements  across  these  6 
programs.  Thus,  this  paper  has  demonstrated  the  value  of 
exploiting  locality  in  superw'ord  registers  in  archita'tures 
that  support  superw'ord-level  parallelism  such  as  tlie  Al- 
tiVec. 

6  Related  Research 

For  well  over  a  decade,  a  significant  body  of  research 
has  been  devoted  to  code  transformations  to  improve  cache 
locality,  most  of  it  targeting  loop  nests  with  regular  data 
access  patterns  [13,  6,  31,  32].  Loop  optimizations  for 
improving  data  locality,  such  as  tiling,  interchanging  and 
skewing,  focus  on  reducing  cache  capacity  misses.  Of  par¬ 
ticular  relevance  to  this  paper  are  approaches  to  tiling  for 
cache  to  exploit  temporal  and  spatial  reuse;  the  bulk  of  this 
work  examines  how  to  select  tile  sizes  that  eliminate  both 
capacity  misses  and  conflict  misses,  tuned  to  the  problem 
and  cache  sizes  [7,9,  12,  14,  1.^,  16. 19, 28.  30,26].  The  key 
difference  between  our  work  and  that  of  tiling  for  caches  is 
that  interference  is  not  an  issue  in  registers.  Therefore,  mod¬ 
els  that  consider  conflict  misses  are  not  appropriate.  Fur¬ 
ther,  our  code  generation  strategy  must  explicitly  manage 
reuse  in  registers. 

There  has  been  much  less  attention  paid  to  tiling  and 
other  code  transformations  to  exploit  reuse  in  registers, 
where  conflict  misses  do  not  occur,  but  registers  must  be 


explicitly  named  and  managed.  A  few  approaches  examine 
mapping  array  variables  to  scalar  registers  [30,  5, 23].  Most 
closely  related  to  ours  is  the  work  by  Carr  and  Kennedy, 
which  uses  scalar  replacement  and  unroll-and-jam  to  ex¬ 
ploit  scalar  register  reuse  [4].  Like  our  approach,  in  deriving 
the  unroll  factors,  they  use  a  model  to  count  the  number  of 
registers  required  for  a  potential  unrolling  to  avoid  register 
pressure,  and  they  replace  array  accesses,  which  would  re¬ 
sult  in  memory  accesses,  with  accesses  to  temporaries  that 
will  be  put  in  registers  by  the  backend  compiler.  Their 
search  for  an  unroll  factor  is  constrained  by  register  pres¬ 
sure  and  another  metric  called  halunce  that  matches  mem¬ 
ory  access  time  to  floating  point  computation  time.  Our 
approach  is  distinguished  from  all  these  others  in  that  the 
model  for  register  requirements  must  take  spatial  locality 
into  account,  we  replace  array  accesses  with  superw'ords 
rather  tlian  scalars,  and  we  also  consider  the  optimizations 
in  light  of  superw'ord  parallelism. 

There  are  several  recent  compilation  systems  developed 
forsuperword-level  parallelism  [20,  27,  8.  10, 1],  Most,  in¬ 
cluding  also  commercial  compilers  [29,  24],  are  based  on 
vectorization  technology  [27,  10].  In  contrast.  Larsen  and 
.\marasinghe  devised  a  superword-level  parallelization  sys¬ 
tem  for  multimedia  extensions  [20].  They  point  out  that 
there  are  many  differences  bebv'een  the  multimedia  exten¬ 
sion  architectures  and  vector  architectures,  such  as  short 
vectors,  ease  of  mixing  with  scalar  instructions,  and  need 
for  alignment  of  memory  accesses  [21].  They  argue  that 
their  algorithm  for  finding  superw'ord-level  parallelism  from 
a  basic  block  instead  of  a  loop  nest  is  much  more  effec¬ 
tive  than  using  vectorization-based  techniques.  None  of  the 
above  approaches  exploit  reuse  in  the  superw'ord  register 
file. 

7  Conclusion 

This  paper  presents  an  algorithm  for  compiler-controlled 
caching  in  superw'ord  register  files.  The  algorithm  is  appli¬ 
cable  to  multimedia  extensions  such  as  Intel’s  SSE.  Pow¬ 
erPC's  AltiVec,  and  also  to  Processor-in-memory  (PIM)  ar¬ 
chitectures  with  support  for  superword  operations. 

We  implemented  our  approach  in  an  existing  compiler 
targeting  superword-level  parallelism.  We  presented  exper¬ 
imental  results,  derived  automatically,  comparing  the  per¬ 
formance  of  six  benchmarks/multimedia  kernels  optimized 
for  parallelism  only,  using  SLP,  and  optimized  for  both  par¬ 
allelism  and  locality.  Our  results  show  speedups  ranging 
from  1.3  to  2.8X.  and  an  average  of  2.2X.  on  the  6  pro¬ 
grams  as  compared  to  using  SLP  alone,  and  most  memory 
accesses  are  removed. 

The  approach  taken  here  that  separates  optimizations 
for  SLL  and  SLP  is  convenient  for  implementation  pur¬ 
poses.  since  we  are  building  upon  the  work  of  others.  Fur¬ 
ther,  as  there  are  now  a  few'  other  compilers  that  exploit 
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superaord-level  parallelism  [27,  8.  10,  1],  the  same  can 
be  used  to  extend  these  existing  systems  to  incorporate 
compiler-controlled  caching  in  superw'ord  registers.  Ideally, 
how'ever,  an  optimizer  that  integrates  the  superword  paral¬ 
lelism  and  locality  techniques  could  be  even  more  effective. 
For  example,  in  a  combined  algorithm,  selection  of  which 
loops  to  parallelize  could  also  take  superttord-level  locality 
into  account.  A  combined  algorithm  is  tlie  subject  of  future 
work. 
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ABSIRAC  1 

This  paper  presents  a  compiler  algorithm  and  se\'eral  opti¬ 
mization  techniques  to  exploit  a  DRAM  memory  characteris- 
ticlpage  mode)  automatically.  A  page-mode  memory  access 
exploits  a  form  of  spatial  locality,  wliere  the  data  item  is 
in  the  same  row  of  the  memory  buffer  as  the  previous  ac¬ 
cess.  Thus,  access  time  is  reduced  because  the  cost  of  row 
selection  is  eliminated.  The  algprithm  increases  frequency  of 
page- mode  accesses  by  reordering  data  accesses,  grouping  to¬ 
gether  accesses  to  the  same  memory  row.  We  implemented 
this  algprithm  and  present  speedup  results  for  four  multi- 
media  kernels  ranging  from  1.25  to  2.19  for  a  Processing-In- 
Memory  (PIM)  embedded  DRAM  device. 

1.  IN  I  Rom  e  I  ION 

Memory  delays  are  a  major  performance  bottleneck  in 
embedded-DRAM  systems,  where  the  memory  latencies  seen 
by  t  he  processor  are  dominated  by  theon-chip-DRAM  access 
time.  DRAM  modules  support  an  efficient  page-mode  access, 
where  a  memory  access  to  a  bcation  currently  in  the  DRAM 
open-row  buffer  fetches  the  data  directly  from  that  buffer, 
eliminating  the  cost  of  fetcliing  the  row  from  the  DRAM 
array.  Page-mode  accesses,  when  applicable,  are  supported 
by  the  DRAM's  memory  controller.  To  fully  exploit  lower 
latency  page- mode  accesses,  the  user  or  the  compiler  must 
reorganize  the  computation  so  that  accesses  to  a  same  mem¬ 
ory  row  are  grouped  together,  and  there  are  no  intervening 
accesses  to  other  rows. 

In  the  past  decade,  most  of  the  research  on  compiler  opti¬ 
mizations  for  the  memory  liierarchy  focused  on  exploiting 
data  locality  in  caches  [4,  7,  8,  9,  10.  15,  24.  25].  Al¬ 
though  cache  optimizations  and  page-mode  optimizations 
have  the  common  goal  of  exploiting  data  reuse  (in  caches 
or  in  the  DRAM’s  open  row,  respecth-ely ) ,  the  analysis  and 
code  transformations  required  are  different.  For  example, 
loop  tiling  is  used  to  exploit  temporal  reuse  in  caches  by 
bringing  together  in  time  loop  iterations  that  access  the  same 
data.  The  goal  is  to  keep  the  data  accessed  in  a  tile  in  cache, 
and  the  order  of  the  accesses  within  a  tile  is  not  important. 
On  the  other  hand,  exploiting  page-mode  accesses  requires 
not  only  bringing  together  in  time  loop  iterations  that  access 
data  in  a  same  memory  row,  but  also  grouping  these  data 
accesses  together.  Exposing  opportunities  for  grouping  ac¬ 


cesses  to  a  same  array  may  require  transformations  such  as 
unroll-and-jam.  to  bring  accesses  issued  in  distinct  loop  it¬ 
erations  to  the  body  of  the  transformed  loop,  and  statement 
reordering,  to  group  the  memory  accesses. 

Recent  research  has  proposed  to  exploit  page-mode  accesses 
through  manual  code  transformations  [19,  17,  3|.  This  paper 
presents  a  compiler  algorithm  for  exploiting  page-mode  auto¬ 
matically.  Our  algorithm  is  implemented  in  the  SUIF  com¬ 
piler  infrastructure  [13],  and  it  le\'erages  well-known  com¬ 
piler  analyses  and  code  transformations  to  identify  poten¬ 
tial  page-mode  accesses  and  group  these  memory  accesses 
together.  The  algorithm  is  applicable  to  loop-based  compu¬ 
tations  in  general  embedded  systems  and  it  is  also  applicable 
to  embedded-DRAM  systems  designed  to  exploit  the  large 
on-chip  bandwidths  b\‘  transferring  and  processing  objects 
larger  than  a  machine  word  ]23,  1]. 

We  han’e  performed  an  experimental  ex'aluation  of  our  al¬ 
gorithm  on  a  Processing-In-Memory  (PIM)  device  that  is 
part  of  the  DIVA  architecture  [12],  where  the  PIM  pro¬ 
cessor  is  capable  of  transferring  and  processing  256-bit  ob¬ 
jects  (superwords)  in  parallel.  Our  results  show  the  perfor¬ 
mance  improvements  from  exploiting  page-mode  accesses, 
and  the  combined  benefits  of  page-mode  accesses  and  other 
compiler  optimizatbns  targeting  architectures  with  support 
for  superword-level  parallelism*  (SLP)  ]16,  22].  We  obtain 
speedups  ranging  for  1.25  to  2.19  for  four  multimedia  ker¬ 
nels.  This  paper  makes  the  following  contributions: 

•  A  new  compiler  algorithm  for  automatically  exploiting 
page-mode  memory  accesses: 

•  An  experimental  ex’aluation  of  the  algorithm  on  four 
data-intensive  multimedia  kernels; 

•  A  discussion  of  practical  issues  that  must  be  addressed 
when  exploiting  page-mode  accesses  in  combination 
with  other  compiler  optimizations. 

This  paper  is  organized  as  follows.  Section  2  motiv'ates  our 
approach  using  a  simple  example.  Section  3  introduces  our 
algorithm  for  exploiting  page-mode  memory  accesses.  Sec¬ 
tion  4  presents  experimental  results  on  a  set  of  four  multi- 
media  kernels.  Section  5  addresses  practical  issues  which  are 
the  subject  of  future  work.  Related  research  is  discussed  in 
Section  6  and  Section  7  concludes  the  paper. 

2.  MoriwnoN 

Figure  1  illustrates  the  benefits  of  page-mode  accesses  using 
a  simple  loop  nest  wit  h  two  array  references.  Assuming  that 

'Fine  grain  SIMD  parallelism  in  a  register  larger  than  a 
machine  word 
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Ta!)le  1:  Memory  Latency  Computation 


the  sizes  of  arrays  A  and  B  are  larger  than  the  DRAM's 
open-row  buffer,  all  array  references  in  Figure  i(a)  aie  in 
random-mode,  since  reference  B\i]  displaces  the  DRAM  row 
containing  -4L/][i]  from  the  open-row  buffer  and  vice-versa. 

For  the  same  number  of  memory  accesses  in  this  loop  nest, 
we  can  increase  the  page  mode  memory  accesses  applying 
a  series  of  code  transformations,  as  shown  in  Figure  1(b). 
First,  unroll-and-jam  is  used  to  unroll  the  outer  i  loop  and 
fuse  together  the  resulting  inner  j  loop  bodies.  Unroll-and- 
jam  creates  opportunities  for  page-mode  accesses  b\'  moving 
array  references  from  successive  loop  iterations  of  the  outer 
loop  into  the  body  of  the  transformed  inner  loop.  Once  the 
loop  is  unrolled  and  the  copies  of  tlie  loop  body  are  fused, 
accesses  to  the  same  memory  page  in  the  loop  body  may 
be  grouped  togetlier  by  reordering  the  memory  accesses  in 
the  transformed  loop  body,  if  the  reordering  does  not  violate 
data  dependences. 

In  Figure  1(b),  foDowing  unroll-and-jam,  where  the  t  loop  is 
unrolled  by  a  factor  of  4.  references  to  the  same  array  {A  or 
B)  in  the  body  of  the  transformed  loop  are  grouped  together. 
This  results  in  page-mode  accesses  for  all  references  in  the 
loop  body,  except  leading  references  -4[j][i]  and  B[i],  which 
are  in  random  mode. 

Table  1  shows  the  total  memory  access  cost  for  the  code  in 
Figures  1  (a)  and  (b),  if  we  assume  that  latencies  for  random 
mode  and  page  mode  accesses  are  uniform,  and  that  accesses 
are  not  ^ing  through  cache.  Assuming  that  random-mode 
latency  is  three  times  the  page-mode  latency  as  in  |14],  loop 

(a)  has  a  total  latency  cost  of  6  ♦  n  ♦  m  ♦  PM  Latency  ^  while 

(b)  has  a  cost  of  Z*n*m*  PM  Latency ,  a  factor  of  2  difference 
in  overall  memory  latency. 

This  example  shove's  the  potential  for  improving  performance 
in  embedded  DRAM  de%dces  through  the  pre\dously-debcribed 
code  transformations.  To  expose  opportunities  for  p>age- 
mode  accesses  by  applying  unroll-and-jam  and  memory  ac¬ 
cess  reordering,  a  compiler  algorithm  must:  (1)  determine 
the  safety  of  these  code  transformations  and  select  a  loop 
for  which  unrolling  is  profitable:  (2)  select  an  unroll  factor 
that  increases  page-mode  accesses  while  not  causing  register 


(b)  After  unroll-and-jam  and  reordering 
Figure  1:  LIuroIl-au<l-jam  and  Reoi’<lering 

spilling;  and,  (3)  transform  the  code  to  reorder  the  memory 
accesses.  In  the  next  section  we  present  our  compiler  algo¬ 
rithm  for  exploiting  page-mode  accesses,  wliich  includes  the 
three  steps  abo\'e. 

We  have  de\'eloped  this  algDiithm  in  the  context  of  a  com¬ 
piler  for  DI\’A,  a  system-architecture  that  incorporates 
processing-in-memory  embedded  DRAM  de\uces  as  smart- 
memory  co-processors  in  an  otherwise  conventional  system  |12| 
Altliough  the  proposed  compiler  algorithm  is  not  specific 
to  the  requirements  of  the  DIVA  architecture,  we  describe 
the  aigDrithm  from  the  viewpoint  of  an  architecture  that 
supports  superword-let^el  parallelism,  with  an  instruction  set 
akin  to  multimedia  extensions  such  as  Intel’s  SSE  and  Mo¬ 
torola's  AltiVec.  Superword-Ie\'el  parallelism  refers  to  per¬ 
forming  the  same  operation  in  parallel  on  multiple  fields  of 
a  superw’ord,  which  is  an  aggregate  object  larger  than  a  ma¬ 
chine  word.  In  the  following  algorithm  description,  we  will 
refer  to  register  width  to  soipport  the  notion  that  a  machine 
miglit  have  different  register  widths  for  distinct  objects.  If  a 
machine  does  not  support  superword-le\'el  operations,  then 
the  register  width  is  the  same  as  the  machine  word. 

In  pre\’ious  work,  we  presented  an  algorithm  for  exploit¬ 
ing  spatial  and  temporal  locality  in  superword  register  files 
in  a  compiler  that  already  supports  super\\'ord-le\'el  pjaral- 
lelism  [22].  In  this  paper,  we  sliow  that  with  a  similar  ap¬ 
proach  we  can  also  exploit  spatial  locality  in  tlie  pa^  of  a 
DRAM  memory  array. 

3.  Aix;()RiriiM 

In  this  section  we  introduce  a  compiler  algorithm  for  exploit¬ 
ing  page-mode  memory  accesses.  Our  algorithm  is  applicable 
to  loop  nests  with  array  references  in  the  loop  body,  where 
the  array  subscript  expressions  are  affine  functions  of  the 
loop  index  variables.  Only  array  accesses  are  reordered  by 
the  algorithm,  since  it  is  difficult  to  determine  whether  two 
scalar  accesses  are  on  tlie  same  memory  page.  For  presen¬ 
tation  purposes,  we  make  some  simplifying  assumptions  as 
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1.  Select  a  loop  to  unroll 

2.  control  register  pressure 
Align  the  loop  to  page  boundari*^^ 
4.  unroll -and- jam 
s.  Reorder  memory  accesses 


Figure  2:  Algorithm 

follows. 

1.  Array  objects  aie  aligned  at  memory  page  boundaries. 

2.  The  lowest  dimension  sizes  of  array  objects  are  multi¬ 
ples  of  a  memory  page  size. 

3.  The  compUer  backend  does  not  change  the  memory 
access  order  generated  by  the  algorithm. 

Some  of  these  assumptions  can  be  removed  by  modifying  the 
compiler  backend  (1.3)  or  l^’  padding  array  objects  (2). 

The  algorithm  presented  in  this  paper  unrolls  a  single  loop 
in  a  loop  nest,  since  in  practice  unrolling  more  than  one  loop 
could  create  register  pressure  and  intruction  cache  misses.  A 
set  of  heuristics  is  used  to  select  which  loop  to  unroll  and  its 
unroll  amount.  Tliese  heuristics  result  in  a  fast  algorithm 
that  is  effective  for  the  benchmarks  presented  in  Section  4. 

Howe\'er,  unrolling  multiple  loops  in  a  loop  nest  might  ex¬ 
pose  more  opportunities  for  pagie^mode  accesses  than  when 
unrolling  a  single  loop.  In  pre\'ious  work  [22]  we  present  an 
algorithm  for  exploiting  supervr'ord-le\'el  locality  which  uses 
unroll-and-jam  to  expose  data  reuse,  and  unrolls  multiple 
loops  in  a  nest.  The  computation  of  the  unroll  amounts  re¬ 
quires  a  complex  analysis  to  determine  the  exact  number  of 
superword  registers  needed  to  keep  the  data  accessed  in  the 
loop.  This  complexity  is  due  to  sev’eral  factors  such  as  group 
reuse  among  copies  of  a  reference  created  by  unrolling  (which 
may  reuse  data  in  superword  registers)  and  self-spatial  reuse 
of  the  original  references. 

A  more  complcDC  algorithm  for  exploiting  page-mode  memory 
accesses  which  would  consider  multiple  loops  for  unrolling  is 
the  subject  of  future  work,  and  we  plan  to  les'erage  our  anal¬ 
ysis  and  algorithm  for  selecting  unroll  amounts  described 
in  [22]. 

Figure  2  illustrates  the  steps  of  the  algorithm,  which  are  de¬ 
scribed  in  the  remainder  of  this  section.  The  first  step  selects 
which  loop  to  unroll,  after  determining  the  safety  of  the  code 
transformations  (unroll-and-jam  and  statement  reordering). 
The  second  steps  selects  an  unroll  factor  that  increases  page¬ 
mode  accesses  while  not  causing  register  spilling.  The  last 
three  steps  apply  the  code  transformations  to  the  loop  nest. 

Selecting  a  Loop  To  Lhiroll  The  first  step  of  the  al¬ 
gorithm  selects  a  loop  to  unroll,  based  on  the  number  of 
random-mode  memory  accesses  of  the  loop  nest  after  ap)- 
plying  unroll-and-jam.  The  al^rithm  uses  data  dependence 
information  to  determine  the  safety  of  unroll-and-jam  and 
to  prev'ent  selection  of  umoll  amounts  greater  tlian  the  de¬ 
pendence  distance  if  inner  loop  dependence  distances  are 
negative. 


For  each  loop  I  in  the  loop  nest,  the  algorithm  computes  the 
unroll  amount  Xi  and  its  corresponding  number  of  random¬ 
mode  accesses  such  that  Ri  is  the  smallest  number  of 
random-mode  memory  accesses  if  /  is  selected  to  be  unrolled 
(assuming  that  references  to  a  same  memory  page  can  be 
grouped  together).  Then  the  algorithm  compares  the  num¬ 
ber  cf  random- mode  accesses  of  each  loop  in  the  nest  and 
selects  the  loop  with  the  smallest  Ri. 

When  computing  the  unroll  amount  Xi  that  minimizes  Rt, 
the  algorithm  considers  only  references  that  are  loop- variant 
with  /  in  the  lowest  dimension.  For  a  reference  that  is  loop- 
variant  with  I  in  the  lowest  dimensbn,  unrolling  /  and  jam¬ 
ming  the  copies  of  /  in  the  loop  body  creates  opportunities 
for  page-mode  accesses  between  the  copies  of  the  original  ref¬ 
erence.  On  the  other  hand,  unrolling  loop  I  does  not  change 
the  total  number  of  random-mode  accesses  generated  by  ref¬ 
erences  that  are  loop-variant  with  I  in  one  of  the  liiglier  di- 
menidons.  Loop- independent  dependences  can  be  removed 
by  locality  optimizations  such  as  scalar  replacement  [2]  or 
superword  replacement  [22[. 

For  each  loop  I  the  smallest  unroll  amount  that  minimizes 
Ri  is  computed  as  in  Equation  1. 


where  P  is  the  memory  page  size.  A  is  the  set  of  array  ref¬ 
erences  in  the  loop  nest  which  are  loop- variant  with  /  in 
the  lowest  dimension,  a  is  an  array  reference  in  A,  T(a)  is 
the  type  size  of  a  and  0(0.1)  is  the  coefficient  of  the  index 
variable  /  in  the  lowest-dimension  subscript  of  a. 

After  computing  the  unroll  amounts,  the  algorithm  com¬ 
putes  the  corresponding  number  of  random-mode  memory 
accesses  i?/,  w'ith  the  goal  c£  selecting  the  loop  with  smallest 
i?/.  For  each  loop  /,  the  number  of  random-mode  accesses  R/ 
is  computed  as  the  number  of  distinct  pa^s  in  the  memon/- 
poffe  /ooipnnt  of  A,  Pi(A.X/)  (assuming  that  the  algorithm 
can  group  together  references  to  a  same  page).  In  pre\'i- 
ous  work  [22],  we  present  the  computation  of  the  superword 
/ooiprmt  of  a  set  of  array  references  in  a  loop  nest,  which 
consists  of  the  number  of  distinct  superwords  accessed  by  the 
references,  a  function  of  the  unroll  amounts.  The  memory- 
page  footprint  can  be  computed  in  a  similar’  way  to  that  of 
the  superw’ord  footprint.  First,  the  set  of  references  is  par¬ 
titioned  into  groups  of  uni/ormip  generated  references  [25], 
that  is,  references  to  the  same  array  such  that,  for  each  ar¬ 
ray  dimension,  the  array  subscripts  differ  only  by  a  constant 
term‘d.  Then,  for  each  group  of  references,  the  algorithm 
computes  the  number  of  pages  accessed  in  the  unrolled  loop 
body.  Finally,  the  total  number  of  pages  is  computed  as  the 
sum  of  those  of  each  group  of  uniformly  generated  refisrences. 

Controlling  Register  Pressure  After  selecting  a  loop 
/  to  unroll,  the  algDrithm  adjusts  the  unroll  amount  of  the 
selected  loop  to  avoid  register  pressure  and  register  spilling, 
which  could  offset  the  benefits  of  unroll-and-jam. 

In  a  prev'ious  paper  [22]  we  presented  the  computation  of  the 
number  of  registers  required  to  keep  the  data  accessed  by  the 
references  in  the  loop  nest  after  applying  transformations 
for  increasing  locality  in  the  superword  re^ster  file.  Here 
we  present  a  simplification  of  this  al©Drithm  to  prov'ide  the 

^Ve  assume  that  two  or  more  references  that  access  the  same 
array  but  are  not  uniformly  generated  access  distinct  data 
in  memory,  which  results  in  a  conserv'ative  estimate  of  tlie 
number  of  memory  pages. 
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int  uition  behind  our  approach. 


We  compute  an  upper  bound  of  the  total  number  of  registers 
that  can  be  simultaneously  live  by  partitioning  the  references 
in  the  loop  nest  in  groups  of  uniformly  generated  references 
and  computing  the  superword  footprint  of  each  group. 

For  example,  the  number  of  registers  required  for  a  group 
that  contains  a  single  reference  a  that  is  valiant  with  I  is 
given  by  Equation  2,  assuming  C(a.l)  =  1. 

,VA(a)  =  (2) 


where  ll'  is  the  register  width  in  bytes  (for  example.  H'  = 
4  for  a  32-bit  scalar  register,  and  H'  =  16  for  a  128-bit 
superword  register  such  as  the  AltiVec's  and  T(a)  is  the 
type  size  of  a  in  bytes.  Equation  2. 

The  superword  footprint  of  a  group  consists  of  the  union  of 
the  footprints  of  the  individual  references,  as  some  of  the 
reference  footprints  may  overlap,  depending  on  the  distance 
between  the  constant  terms  in  the  array  subscripts. 

The  total  number  of  registers  required  (TNR)  to  keep  the 
data  accessed  in  the  loop  nest  is  computed  as  the  sum  of 
the  number  of  registers  required  for  each  group  of  uniformly 
generated  references.  If  the  total  number  of  registers  is  larger 
than  the  number  of  registers  available,  the  algorithm  adjusts 
the  unroll  amount  A'l,  by  dividing  it  by  the  ratio  olTNR 
and  the  number  of  available  registers  NREG. 


Xi 


X, 

r  I M/  ^ 

ItttfttI 


(3) 


The  number  of  available  registers  NREG  is  gb'en  by  number 
of  registers  in  the  register  file  minus  the  number  of  registers 
reserved  b\-  our  algorithm  for  temporary  storage. 

Since  the  smallest  type  size  is  used  in  Equation  1,  all  refer¬ 
ences  that  have  spatial  reuse  carried  by  loop  I  can  exploit 
spatial  reuse  fully  at  the  memory  page  level.  Therefore, 
choosing  a  loop  I  that  has  the  smallest  random-mode  ac¬ 
cesses  when  unrolled  by  A';  is  a  reasonable  choice.  Dividing 
it  er'enly  if  too  many  registers  are  used,  as  in  Equation  3.  will 
result  in  a  solution  that  is  also  aligned  to  a  page  boundary 
at  the  beginning  of  the  bop.  However,  choosing  a  differ¬ 
ent  loop  can  result  in  different  register  requirements.  For 
exampb.  if  a  bop  is  selected  and  then  its  unroll  amount  is 
reduced  to  half  because  of  register  pressure,  there  can  be  an¬ 
other  loop  that  results  in  more  random-mode  accesses  but 
requires  fewer  registers,  leading  to  less  overall  random-mode 
accesses  than  the  initial  selectbn. 

Aligning  the  Loop  To  Page  Doiin<larics  If  the  starting 
addresses  of  the  memory  accesses  in  the  unrolled  bop  body 
are  not  aUgned  to  a  page  boundary,  each  set  of  memory  ac¬ 
cesses  to  a  same  array  will  haw  one  additional  random-mode 
access  per  iteration.  To  remove  these  unnecessary  random- 
mocb  accesses,  step  4  of  the  algorithm  splits  the  iteration 
space  of  the  chosen  loop  into  at  most  three  loops  (head,  body 
and  tail),  so  that  the  starting  addresses  in  loop  body  are 
aligned  to  page  boundaries.  The  body  loop  contains  all  it¬ 
erations  that  access  memory  between  the  first  and  the  last 
page  boundary,  with  the  head  loop  performing  previous  it¬ 
erations  starting  from  the  lower  bound  of  the  orignal  bop. 


for(i=0:  i<63;  i+  +  )l 
Load  A[i-|-i] 

Load  B|i-|-1] 


for(i=0;  i<1280:  i-l-=64){ 
Load  A[i-|- 1] 

Load  A[i-l-2] 

Load  A[i-f-64] 

Load  B|i-t-l] 


(a)  Unaligned 


for(i=63;  i<1279;  i-l-=64){ 
Load  A[i-|- 1] 

Load  A[i-l-2] 

Load  A  [in- 64] 

Load  B[i-H] 

1 

for(i=1279;  i<1280;  i-|-|-)j 
Load  A[i-(-i] 

Load  B[i-|-1] 


(b)  Aligned 


Figure  3:  Aligiiineiit  by  Iteration  Space  Splitting 


and  the  tail  bop  computing  subsequent  iterations  up  to  the 
upper  bound  of  the  original  loop. 

Figure  3(a)  shows  an  example  of  an  unrolled  loop  with  mis¬ 
aligned  memory  references.  Assuming  that  array  .4  is  aligned 
to  a  memory  page,  the  memory  accesses  for  one  iteration  of 
the  mirolled  loop  span  a  page  boundary.  In  (b),  the  itera¬ 
tion  space  of  the  original  bop  is  split  so  that  the  memory 
accesses  in  the  body  loop  start  and  end  at  page  boundaries. 
The  lower  bound  of  the  body  bop  and  the  lower  bound  of  the 
tail  bop  are  computed  from  the  array  subscript  expressions 
and  the  bop  bounds  as  follows.  The  earliest  iteration  where 
the  most  array  references  are  aligned  on  a  page  boundary 
is  used  as  the  lower  bound  of  the  body  loop.  Let  a  be  a 
representative  reference  to  be  ahgned.  I  the  bop  index  vari- 
abb  for  the  sebcted  loop,  and  lb  and  ub  the  lower  and  upper 
bounds  for  1.  To  deriw  the  bop  bounds  for  the  copbs  of 
the  selected  bop  resulting  from  iteratbn  space  splittiirg.  we 
begin  with  the  starting  address,  addr,  of  the  references  when 
I  =  lb.  where  addr  =  aligned  4-  offset.  Here,  aligned  refers 
to  the  largest  multiple  of  the  page  size  less  than  addr  and 
offset  is  the  offset  of  addr  within  a  page. 


Assuming  the  stride  of  a  is  I,  the  lower  bounds  of  the  body 
loop  (split!)  and  the  tail  loop  (spllt2)  are  computed  by  the 
following  equations  where  P  is  the  memory  page  size  and 
T(a)  is  the  type  size  of  a. 


split  I 
split'2 


lb  + 


P  o  ffsi  t  mod  P 
TiT)  T(a) 

P 

ub  —  (life  mod 


T(a) 


The  head  bop  is  not  needed  if  the  reference  is  aligned,  as 
is  the  case  when  offset  mod  P  =  0.  If  lb  is  constant,  splitl 
and  splitl  can  be  computed  at  compile  time.  Otherwise, 
they  are  computed  at  run  time. 


If  tire  selected  reference  has  non-unit  stride,  the  solution  is 
much  more  complex.  In  this  case,  we  build  a  modular  linear 
equation  and  choose  the  smallest  solution  [5|. 

Reordering  Memory  Accesses  Finally,  the  reordering 
st  ep  hoists  loads  to  the  top  of  the  loop  body  and  sinks  stores 
to  the  bottom.  While  being  hoisted  /  sunk,  the  loads  /  stores 
to  a  same  array  are  grouped  togetlier  and  sorted  by  their 
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for(i=32:  i<N;  i+=64){  for(i=32;  i<N;  i+=64){ 


load  A[i  +  0]  (RMA) 
load  A[i  +  32]  (RMA) 
load  A[i  +  8]  (RMA) 


load  A[i  +  40 
load  A[i  +  16 
load  A[i  +  48 
load  A[i  +  24 
load  A[i  +  56 


(RMA) 

(RMA) 

(RMA) 

(RMA) 

(RMA) 


load  A 
load  A 
load  A 
load  A 
load  A 
load  A 
load  A 
load  A 


i  +  0]  (RMA) 
i  +  8] 
i  +  16 
i  +  24 
i  +  32 
1  +  40 
i  +  48 
i  +  56 


(RMA) 


(a)  Unsorted 


(b)  Sorted 


Figure  4;  Sorting  Ott'sct  Addresses 


I'arameters 

\  alue 

U  nit 

Hiiiu  1- <m-raoile  laiuiu  v 

12 

( 'vdes 

Page-mode  latency 

4 

Cycles 

size 

256 

BytH> 

Table  2:  Siiniilatiuii  Parainetei-s 


of&et  addresses.  When  there  are  unaligned  array  references 
e^'en  after  aligning  the  loop,  sorting  the  offset  addresses  can 
reduce  the  number  of  random-mode  accesses.  Figure  4  shows 
an  example  where  the  page  size  includes  64  elements  of  array 
A.  All  eight  memory  accesses  are  in  random  mode  before 
sorting.  After  sorting  the  offset  addresses,  only  two  random¬ 
mode  accesses  remain. 


for(...)| 

load  A  (RMA) 
load  B  (RMA) 

Computation 

store  A  (RMA) 
store  B  (RMA) 


for(...){ 
load  A 

load  B  (RMA) 
Computation 
store  B 

store  A  (RMA) 

1 


(a)  Before  (b)  After 


Figure  5:  Grouping  loads  and  stores 


This  step  also  groups  loads  and  stores  to  the  same  array 
when  possible,  to  exploit  page  mode  among  them.  There  can 
be  page-mode  accesses  between  loads  and  stores  if  the  last 
load  and  the  first  store  access  the  same  page,  and  there  are 
no  intervening  memory  accesses  between  them.  The  same  is 
true  between  the  last  store  of  an  iteration  of  the  innermost 
loop  and  the  first  load  of  the  next  iteration.  Using  this  tech¬ 
nique.  at  most  2  random-mode  accesses  per  iteration  can  be 
eliminated.  Figure  5  (a)  sliows  an  example  where  two  array 
objects  are  read  and  written.  Assuming  all  loads  and  stores 
to  the  same  array  objects  access  the  same  memory  page,  the 
loop  in  (a)  results  in  four  random-mode  accesses  whereas  (b) 
has  only  two  random-mode  accesses  per  iteration. 

4.  K.XPKRIMKNTS 

Although  our  algorithm  is  applicable  to  general  embedded- 
DRAM  systems,  for  the  experiments  presented  in  this  paper 
we  used  a  compiler  framework  that  we  have  buUt  for  DIVA, 
as  previously  described.  The  DIVA  PIM  device  has  a  256- 
bit  datapath  for  executing  superword  operations  in  parallel. 
In  addition  to  a  conventional  scalar  register  file,  the  DIVA 
PIM  processor  has  32  256-bit  registers  (each  of  which  can  be 


[Name 

Description 

Ini'iit  Si7.<- 

V.M.M 

Vector- matrix  muitiplv 

fi-l  t-it'int'nls 

MMM 

Matrix-matrix  multiply 

64  elements 

YUV' 

RGB  to  V'UV  com'ersion 

32K  ebments 

FIR 

Finite  impulse  response  filter 

256  filter.  iK 

Tabic  3:  Bauchmark  programs 


Figure  6:  Expcriiueiital  Flow 


treated  as  eight  32-bit  operands,  sixteen  16-bit  operands  or 
32  8-bit  operands).  Thus,  for  data  allocated  to  superword 
registers,  width  IT  from  Section  3  is  256,  and  for  scalar  reg¬ 
isters  IT  is  32.  As  the  DIVA  PIM  devices  contain  no  data 
cache,  exploiting  spatial  locality  in  the  memory  pages  can 
have  significant  impact  on  application  performance. 

A  prototype  of  the  DIVA  PIM  chip  has  been  fabricated  re¬ 
cently  [6],  but  the  complete  DIVA  system  is  not  available  for 
our  experiments  at  the  time  of  this  writing.  Therefore,  we 
used  a  cycle-accurate  DIVA  simulator(DSIM)  |6],  which  is 
modified  from  RSIM  [20].  Table  2  shows  tire  simulation  pa¬ 
rameters  for  the  memory  system  which  closely  match  those 
of  the  IBM  Cu-11  embedded  DRAM  macro  [14].  In  gen¬ 
eral  there  can  be  multiple  DRAM  macros  and  multiple  open 
pages  in  a  single  chip,  but  for  our  experiments  we  assume 
that  only  one  memory  page  is  open  at  any  given  time. 

W'e  implemented  the  bulk  of  the  algorithm  presented  in  tire 
previous  section,  and  integrated  it  into  tire  Stanford  SUIF 
compiler.  In  our  current  implementation,  we  have  not  im¬ 
plemented  alignment  to  page  boundaries  or  combining  bads 
and  stores  for  f>age  mode  accesses.  Howe\'er,  these  steps  of 
the  algorithm  do  not  affect  the  results  for  the  four  bench¬ 
marks  used.  The  input  to  the  modified  SUIF  compiler  is  a 
C  program,  and  the  output  is  an  AltiVec-extended  C  pro¬ 
gram  [18]  which,  in  turn,  is  translated  by  a  preliminary  ver¬ 
sion  of  the  DIVA  gcc  backend. 

Table  3  shows  the  four  kernels  used  to  evaluate  the  effec¬ 
tiveness  of  the  algorithm.  The  kernels  represent  data  in¬ 
tensive  apphcatbns  in  scientific  and  multimedia  domains. 
Figure  6  shows  the  experimental  flow.  The  main  algorithm 
involves  selecting  unroll  factors,  performing  unroll-and-jam 
and  memory  access  reordering,  and  is  represented  by  tire 
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hashed  rectangles  in  Figure  6. 

As  previously  stated,  this  algorithm  is  implemented  as  part 
of  a  compiler  that  exploits  superword-level  parallelism  and 
locality  in  the  superword  register  file.  Thus,  the  experi¬ 
mental  methodology  also  includes  optimizations  to  exploit 
superword-level  parallelism  (SLP).  Further,  we  exploit  spa^ 
tial  and  temporal  locality  in  the  superword  register  file  through 
a  combination  of  uuroll-and-jain  and  superword  replacement 
(S\VR).  Superword  replacement  is  applied  after  unroll-and- 
jam  to  replace  unnecessary  superword  memory  accesses  with 
references  to  superword  temporaries  that  will  then  be  allo¬ 
cated  to  superword  registers  by  a  backend  compiler  [22].  In 
our  previous  work,  we  selected  unroll  factors  for  unroll-and- 
jam  that  maximize  reuse  in  superword  registers:  here,  we  use 
the  unroll  factors  determined  by  the  algorithm  in  Section  3, 
which  are  likely  to  be  larger  than  in  our  previous  work.  In 
some  sense,  the  optimizations  for  page  mode  memory  ac¬ 
cesses  are  complementary  to  exploiting  SLP  and  locality  in 
superword  registers,  and  the  page  mode  optimizations  are 
difficult  to  isolate  in  our  compiler.  In  fact,  because  the  SLP 
and  SWR  optimizations  reduce  the  number  of  memory  ac¬ 
cesses.  we  will  see  less  benefit  from  the  page  mode  optimiza^ 
tions  than  if  considered  in  isolation. 

We  use  as  our  baseline  the  SLP  version  of  the  code  with 
no  unrolling  beyond  what  is  required  to  exploit  paralleliza^ 
tion  of  the  innermost  loop.  The  LINROLL  versnon  includes 
um:oll-and-jam.  where  the  loop  selected  by  the  algorithm  in 
Section  3  is  unrolled  by  the  chosen  amount,  and  inner  loop 
bodies  are  fused  together.  As  compared  to  the  baseline  ver¬ 
sion,  this  version  isolates  the  benefits  of  unroll-and-jam  and 
superword  replacement  in  terms  of  reduced  memory  accesses 
and  less  loop  overhead,  as  compared  to  the  baseline  version. 
The  PMA  version  reflects  the  performance  improvements 
due  to  memory  access  reordering,  yielding  the  full  benefit 
of  the  optimizations  for  page^mode  accesses. 

In  these  experiments,  we  used  optimization  level  -Ol  for 
the  DIVA  gcc  backend  rather  than  a  higher  level  of  opti¬ 
mization.  This  was  required  to  avoid  reordering  of  memory 
accesses  in  subsequent  optimization  passes,  which  occurs  at 
higher  levels  of  optimization.  Since  reordering  commonly  oc¬ 
curs  in  backend  optimizations,  we  discuss  the  implications  of 
combining  the  page-mode  optimizations  with  other  backend 
compiler  techniques  in  the  next  section. 

For  all  programs  but  YLfV,  the  algorithm  was  able  to  unroll 
the  selected  loop  by  the  unroll  factor  determined  by  Equa¬ 
tion  1.  For  YLW,  which  references  six  distinct  arrays,  this 
unroll  factor  was  too  large  and  resulted  in  register  spilling. 
The  algorithm  reduced  the  unroll  amount  by  half  and  the 
register  spilling  was  eliminated. 

We  first  consider  how  the  optimizations  for  exploiting  page¬ 
mode  memory  accesses  impact  memory  stall  time.  In  Fig¬ 
ure  8  shows  the  normalized  execution  times  broken  down 
into  processor  busy  time  and  memory  stall  time,  derived 
fi'om  simulation.  The  UNROLL  version  sees  a  significant  re¬ 
duction  in  both  processor  busy  time  (9%  to  60%)  and  mem¬ 
ory  stall  time  (25%  to  71%).  The  primary  reason  for  this 
is  that  superword  replacement  has  eliminated  a  large  num¬ 
ber  of  memory  accesses,  which  not  only  reduces  memory 
stall  time,  but  also  reduce  processor  busy  time  by  eliminat¬ 
ing  address  calculation  and  instruction  issue  associated  with 
the  eliminated  memory  accesses.  Further,  reduction  in  loop 
control  ov'erhead  also  reduces  processor  busy  time.  For  all 
programs,  the  PMA  ^'ersion  further  reduces  memory  stall 
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Figure  7:  .SLP  versions  of  V'MM  and  MMM 


Figure  8:  Normalized  Execution  Time 
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Benchmarks 

Figure  9:  Percentage  of  Page-Mode  Accesses 


time  by  21%  to  33%.  As  compared  to  the  UNROLL  version, 
we  have  not  eliminated  any  instructions,  but  rather  have 
converted  random-mode  accesses  to  page-mode  accesses. 

Next  we  consider  in  Figure  9  the  percentage  of  all  memory 
accesses  that  are  in  page  mode,  with  the  remainder  in  ran¬ 
dom  mode.  The  percentages  of  page-mode  accesses  ranges 
from  25%  to  37%  for  the  baseline  version  of  the  programs. 

W  e  see  a  decrease  in  page- mode  accesses  as  a  percentage  of 
memory  accesses  for  most  programs  for  the  UNROLL  ver¬ 
sion,  railing  from  6%  to  32%.  This  effect  is  because  super¬ 
word  replacement  has  removed  a  large  number  of  memory 
accesses,  and  the  remainder  tend  to  be  in  random  mode.  For 
example,  in  the  \'MM  loop  shown  in  Figure  7(a)  after  SLP, 
references  to  C[i]  [j]  in  the  k-loop  are  loop-im’ariant  after 
unrolling,  and  are  usually  removed,  but  were  page-mode  ac¬ 
cesses  in  the  SLP  version  due  to  the  preceding  store  to  the 
same  location.  In  MMM,  the  page-mode  percentage  actually 
increases  Ear  the  UNROLL  version,  as  can  be  seen  in  Fig¬ 
ure  7(b).  References  to  A  [i]  [k]  are  random-mode  accesses, 
and  are  eliminated  by  superword  replacement.  For  the  PMA 
version,  which  reflects  the  same  number  of  memory  accesses 
as  the  UNROLL  version,  the  percentages  of  page-mode  ac¬ 
cesses  range  from  63%  to  87%. 

These  results  shew  that  our  algorithm  has  been  successful 
at  increasing  the  percentage  of  page-mode  accesses  and  re¬ 
ducing  the  memory  stall  time.  We  now  see  liow  the  ap¬ 
proach  impacts  the  overall  performance.  Figure  10  shows 
the  speedups  for  the  SLP,  UNROLL  and  PMA  versions  of 
Figure  6.  Overall  speedups  as  compared  to  the  SLP  base¬ 
line  range  from  1.25  to  2.19.  Most  of  this  speedup  comes 
from  the  1.19  to  1.89  improvement  from  unroll-and-jam  and 
superword  replacement,  as  can  be  seen  from  the  LJNROLL 
version.  The  speedup  of  the  PMA  version  over  the  LJNROLL 
version  ranges  from  1.04  to  1.16. 

5.  IMPl.KMKNTAI  lON  IS.SUI  S 

In  this  section,  we  consider  in  general  terms  how  to  in¬ 
corporate  this  algorithm  into  current  and  future  compilers. 
First,  the  compiler  backend  optimizations  must  be  aware 
that  page-mode  optimizatbns  are  being  performed.  Other¬ 
wise.  instruction  reordering  optimizations  to  increase  instruction- 
level  parallelism  may  undo  the  effect  of  the  page-mode  opti¬ 
mizations.  A  simple  solution  is  to  keep  the  relative  order  of 
memory  operations  Intact  when  performing  instruction  re¬ 
ordering.  There  is  an  interesting  tradeoff  space  that  must 
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Figure  10:  Speedup  Breakrlowii 


be  considered,  since  page-mode  optimizations  which  favor 
memory  accesses  to  the  same  page  may  potentially  lengthen 
the  critical  path  to  performing  computations,  where  mul¬ 
tiple  operands  from  different  pages  may  be  required.  This 
potential  problem  is  mitigated  if  there  are  a  large  number 
of  memory  units  that  can  operate  in  parallel,  or  if  there  are 
multiple  memory  pages  from  which  data  can  be  accessed 
rather  than  the  single  page  used  in  our  experiments. 

A  second  issue  is  how  to  combine  this  approach  with  cache 
optimizations  for  devices  that  have  on-chip  data  caches.  In 
cache-based  architectures,  tire  page-mode  optimizations  are 
still  apphcable  as  long  as  the  unrolled  footprint  for  an  object 
exceeds  the  cache  line  size.  In  such  a  case,  the  spatial  locality 
within  the  memory  page  complements  spatial  locality  within 
a  cache  line. 

6.  UKI  Al  Kl)  WORK 

Previous  research  has  identified  the  benefits  of  exploiting 
page-mode  DRAM  accesses  [19,  17,  3,  21.  11].  Moyer  mod¬ 
eled  memory  systems  analytically  and  developed  a  compiler 
technique  called  access  ordering  that  reorders  memory  ac¬ 
cesses  to  better  utilize  the  memory  system  [19].  McKee  et 
al.  described  a  Stream  Memory  Controller  (SMC)  whose  ac¬ 
cess  ordering  circuitry  attempts  to  maximize  memory  sv'stem 
performance  based  on  the  device  characteristics  [17].  Their 
compiler  is  used  to  detect  streams  but  access  ordering  and 
issue  is  determined  by  the  hardware.  Chame  et  al.  manu¬ 
ally  optimized  an  application  for  a  PIM-based  (embedded- 
DRAM)  system  [3]  by  applying  loop  unrolhng  and  memory 
access  reordering  to  increase  the  number  of  page- mode  ac¬ 
cesses. 

Panda  et  al.  have  developed  a  series  of  techniques  to  exploit 
page-mode  DRAM  access  in  high-level  svmtliesis  [21].  Their 
techniques  include  scalar  variable  clustering,  memory  access 
reordering,  hoisting  and  loop  transformations.  While  tlieir 
ASIC  design  was  able  to  exploit  page-mode  memory  access, 
they  do  not  describe  an  algorithm  for  automatic  code  gen¬ 
eration.  Grun  et  al.  have  optimized  a  set  of  benchmarks 
to  better  utilize  efficient  memory  access  modes  for  their  IP 
library  based  Design  Space  Exploration  [11].  However,  their 
focus  was  on  accurate  timing  models  of  tire  hardware  system 
description. 

This  paper  is  distinguished  from  previous  research  as  the 
design  and  implementation  of  a  compiler  algorithm  to  ex- 
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ploit  page-mode  automatically.  Although  the  experiments 
are  performed  for  a  PIM-based  system  [12],  this  compiler 
framework  is  applicable  to  embedded-DRAM  systems  and 
can  also  be  used  as  a  preprocessor  for  high-level  synthesis. 

7.  CONCI.ISION 

This  paper  presented  a  compiler  algorithm  for  exploiting 
page^mode  memory  access  in  embedded-DRAM  systeins.  Our 
compiler  algorithm  has  been  implemented  in  the  Stanford 
SUIF  compiler  infrastructure  and  evaluated  for  four  scientific 
and  multimedia  kernels.  The  speedups  achieved  by  exploit¬ 
ing  pagp^mode  memory  access  alone  range  from  1.04  to  1.16 
for  four  multimedia  kernels,  resulting  in  overall  speedups 
ranging  from  1.25  to  2.19  when  combined  with  optimizations 
targeting  superword-level  parallelism  and  locality.  These 
results  show  that  there  is  a  distinct  benefit  in  exploiting 
page-mode  memory  access  in  embedded  systems,  where  the 
DRAM  access  time  dominates  the  memory  latency  seen  by 
the  processor.  Furthermore,  our  results  show  that  for  em¬ 
bedded  systems  with  support  for  superword-level  parallelism  |23. 
1,  12|,  optimizations  for  exploiting  the  DRAM's  page-mode 
accesses  are  complementary  to  optimizations  for  superword- 
level  parallelism  and  superword-level  locality. 
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.Xbstract 

In  this  paper,  we  describe  an  algorithm  and  iniplenientation  of  locality  optimi/ations  for  archi¬ 
tectures  with  instruction  sets  such  as  Intel's  SSK  and  Motorola's  .MtiVec  that  support  operations 
on  superwords,  i.e.,  aggregate  objects  consisting  of  several  machine  words.  We  treat  the  large  su¬ 
perword  register  file  as  a  compiler-eontrolled  cache,  thus  avoiding  unneces.sarv'  memory  accesses 
by  exploiting  reuse  in  stiperword  registers,  fhis  research  is  distinguished  from  previous  work  on 
exploiting  reuse  in  scalar  registers  because  it  considers  not  only  temporal  but  also  spatial  reuse.  As 
compared  to  optimi/ations  to  exploit  reuse  in  cache,  the  compiler  must  also  manage  replacement, 
and  thus,  explicitly  name  registers  in  the  generated  code.  We  describe  an  implementation  of  our 
approach  integrated  with  a  compiler  that  exploits  superword-level  parallelism  (Sl.P).  We  present  a 
set  of  results  derived  automatically  on  4  multimedia  kernels  and  2  scientiftc  benchmarks.  Our  re¬ 
sults  .show  speediips  ranging  from  !..■<  to  IX  on  the  6  programs  as  compared  to  using  Sl.P  alone, 
and  we  eliminate  the  majority  of  memoty'  accesses. 


I.  Introduction 

In  response  to  the  increasing  importance  of  multimedia  applications  in  embedded  and  general- 
puipose  computing  environments,  many  microprocessors  now  incoiporale  an  expanded  instritclion 
set  and  architectural  extensions  speciltcally  targeting  multimedia  requirements.  I'he  core  compo¬ 
nent  of  such  architectural  extensions  is  a  functional  unit  that  can  operate  on  aggregate  objects, 
perfomitng  bit-level  operations,  or  SIMD  parallel  operations  on  variable-sized  ftelds  in  the  object 
(c.i;.,  8,  16,  .^2  or  64-bit  fields).  If  the  aggregate  objects  are  larger  than  the  size  of  a  machine  word, 
then  they  are  called  superwords  1 1 1.  1-xamples  include  Motorola's  AltiXec  and  Intel's  SSH,  a  de¬ 
scendant  of  MMX.  If  the  same  size  as  the  machine  word,  then  individual  fields  are  referred  to  as 
subwords  [2|.  .\  related  class  of  architectures  employ  processing-in-memory  (PIM)  technology  to 
exploit  the  high  memory  bandwidth  when  processing  logic  is  combined  on  chip  w  ith  large  amounts 
of  DR.XM;  several  PIM-based  architectures  rely  on  supenvord  parallelism  to  make  more  etfective 
use  of  available  memory  bandwidth  |.4,  4,  5, 6). 

While  multimedia  extension  and  related  architectures  have  been  available  for  some  time,  con¬ 
venient  methodologies  for  developing  application  code  that  targets  these  extensions  are  in  their  in¬ 
fancy.  [  here  is  recent  compiler  research  for  such  architectures  to  automatically  exploit  superwotxi- 
level parallelism,  performing  computations  or  memory  accesses  in  parallel  in  a  single  instruction 
issue  [1,7,  8,9,  lOj. 

In  this  paper,  we  recognize  an  additional  optimization  opportunity  not  addressed  by  this  previous 
work.  An  important  feature  of  all  such  architectures  is  a  register  tile  of  superwords  (c.g.,  each  128 
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hits  wide  in  an  Alti\ec),  usually  in  addition  to  the  scalar  register  file.  set  of  32  such  superword 
registers  represents  a  not  insignificant  amount  of  storage  close  to  the  processor.  Accessing  data 
from  superword  registers,  versus  a  cache  or  main  memoiy,  has  two  advantages,  t  he  most  obvious 
advantage  is  lower  latency  of  accesses;  even  a  hit  in  the  1.1  cache  has  at  least  a  1-cycle  latency. 
Accesses  to  other  caches  in  the  hierarchy  or  to  main  memory  carry  much  higher  latencies,  .\nother 
advantage  is  the  elimination  of  memory  access  instructions,  thus  reducing  the  number  of  instructions 
to  be  issued. 

In  this  paper,  we  treat  the  superword  register  tile  as  a  small  compiler-controlled  cache.  We 
develop  an  algorithm  and  a  set  of  optimizations  to  exploit  reuse  of  data  in  superword  registers 
to  eliminate  unnecessary  memory  accesses,  which  we  call  superwoni-level  locality.  We  evaluate 
the  effectiveness  of  these  superword- level  locality  (SLL)  optimizations  through  an  implementation 
integrated  with  the  algorithm  for  exploiting  superword-level  parallelism  (St.P)  presented  in  [  1 1. 

Our  approach  is  distinguished  from  previous  work  on  increasing  reuse  in  cache  1 1 1,  12,  13,  14, 

13,  16,  17,  1S|,  111  that  the  compiler  must  also  manage  replacement,  and  thus,  explicitly  name  the 
registers  in  the  code.  As  compared  to  previous  work  on  exploiting  reuse  in  scalar  registers  [IS, 
h),  20],  the  compiler  considers  not  just  temporal  reuse,  but  also  spatial  reuse,  for  both  indiv  idual 
statements  and  groups  of  references.  Further,  it  also  considers  superword  parallelism  in  making 
its  optimization  decisions.  lixploiting  spatial  and  group  reuse  in  superword  registers  requires  more 
complex  analysis  as  compared  to  exploiting  temporal  reuse  in  scalar  registers,  to  determine  which 
accesses  map  into  the  same  superword. 

In  conjunction  w  ith  exploiting  SI.P,  the  algorithm  performs  what  we  call  superwoni  replace¬ 
ment,  to  replace  accesses  to  contiguous  anay  data  with  superword  temporaries  and  exploit  reuse  by 
replacing  accesses  to  the  same  superword  with  the  same  temporaty.  Following  this  code  transfor¬ 
mation.  a  separate  compilation  pass  will  be  able  to  alkx'ate  superword  registers  corresponding  to 
the  superword  temporaries.  To  enhance  the  elTectiveness  of  superword  replacement,  it  is  combined 
with  a  loop  transfomiation  called  unroll-atul-jain,  whereby  outer  loops  in  a  loop  nest  are  unrolled, 
and  the  resulting  duplicate  inner  loop  bodies  are  fused  together.  Unroll-and-jam  reduces  the  dis¬ 
tance  between  reuse  of  the  same  superword,  w  hen  reuse  is  can  ied  by  an  outer  loop,  and  brings 
opportunities  for  supenvord  replacement  into  the  innermost  loop  body  of  the  transformed  loop  nest, 
fhe  optimization  algorithm  derives  appropriate  unroll  factors  for  each  loop  in  the  nest  that  attempt 
to  maximize  reuse  while  not  e.xceeding  the  number  of  av  ailable  registers. 

The  contributions  of  this  paper  are  as  follows: 

•  .\n  algorithm  for  exposing  opportunities  for  compiler-controlled  caching  of  data  in  superword 
register  tiles  using  unroll-and-jam.  fhe  two  main  components  of  this  algorithm  are  a  model  of 
the  number  of  memory  accesses  and  registers  required  associated  with  a  set  of  unroll  factors, 
and  a  strategy  for  navigating  the  search  space  of  possible  unroll  factors. 

•  .A  description  of  a  set  of  code  transformations,  w  Inch  in  aggregate  we  call  superword  replace¬ 
ment,  for  exploiting  superword  register  reuse. 

•  Ivxperimental  results,  derived  automatically,  comparing  performance  of  six  benchmarks  multimedia 
kernels  optimized  for  parallelism  only,  SI.P,  and  optimized  for  both  parallelism  and  superword- 
level  locality.  Our  results  show  speediips  ranging  from  1.3  to  3.  IX  as  compared  to  using  SI.P 
alone,  and  we  eliminate  the  majority  of  memoiy  accesses. 


372 


1  his  papor  exlonds  an  earlier  description  of  this  work  in  several  ways  [21 1.  We  have  extended 
the  alyorithni  and  register  requirements  analysis  to  exploit  group-temporal  reuse  across  iterations 
of  the  transformed  loop  nest.  We  have  also  greatly  expanded  the  description  of  code  generation. 
In  the  experimental  results  description,  we  have  improved  the  results  and  provided  a  more  detailed 
breakdown  of  the  contributions  of  the  different  techniques. 

The  remainder  of  the  paper  is  organized  into  S  sections.  Section  2  moti\ates  the  problem  and 
introduces  terminology  used  in  the  remainder  of  the  paper.  Section  y  presents  an  overview  of 
the  superword-level  locality  algorithm.  Section  4  describes  how  the  algorithm  computes  the  total 
number  of  registers  required  for  exploiting  reuse  and  the  resulting  number  of  memory  accesses. 
Section  5  describes  aspects  of  how  the  search  space  is  navigated.  Section  6  presents  optimizations 
to  actually  achieve  this  reuse  of  data  in  superword  registers.  Section  7  presents  experimental  results 
deri\ed  automatically  by  an  implementation  in  the  Stanford  SUIl-  compiler.  Section  8  discusses 
related  work  and  Section  presents  conclusions  and  future  work. 

2.  Back^roiiiul  and  Motivation 

In  many  cases  superword- level  parallelism  and  superword- level  locality  are  complementaty  op¬ 
timization  goals,  since  achieving  Sl.P  requires  each  operand  to  be  a  set  of  words  packed  into  a 
supenvord.  which  happens,  with  no  extra  cost,  when  an  array  reference  w  ith  spatial  reuse  is  loaded 
from  memory  into  a  superword  register.  I'herefore,  in  many  cases  the  loop  that  carries  the  most 
superword-level  parallelism  also  carries  the  most  spatial  reuse,  and  benefits  from  Sl.l.  optimiza¬ 
tions.  In  this  paper,  w  e  achieve  SI. I.  and  Sl.P  somew  hat  independently,  by  integrating  a  set  of  SI. I. 
optimizations  into  an  existing  Sl.P  compiler  [  1 1.  fhe  remainder  of  this  section  motivates  the  SI. I. 
optimizations. 

Achieving  locality  in  supenwrd  registers  dilTers  from  locality  optimization  for  scalar  registers. 
To  exploit  temporal  reuse  of  data  in  scalar  registers,  compilei's  use  scalar  replacement  to  replace 
anay  references  by  accesses  to  temporary  scalar  variables,  so  that  a  separate  backend  register  allo¬ 
cator  w  ill  exploit  reuse  in  registers  [  U)|.  In  addition,  unroll-and-jam  is  used  to  shorten  the  distances 
between  reuse  of  the  same  array  location  by  unrolling  outer  loops  that  carry  reuse  and  fusing  the 
resulting  inner  loops  together  j  h)|. 

In  contrast,  a  compiler  can  optimize  for  superword- level  locality  in  superword  registers  through 
a  combination  of  unroll-and-jam  and  superword  replacement,  t  hese  techniques  not  only  exploit 
temporal  reuse  of  data,  but  also  spatial  reuse  of  nearby  elements  in  the  same  superword.  In  tact,  even 
partial  reu.se  of  superwords  can  be  exploited  by  merging  the  contents  of  two  registers  containing 
superwords  that  are  consecutive  in  memory  (see  Section  6.4).  I  hus.  as  is  common  in  multimedia 
applications  [22 j,  streaming  computations  with  little  or  no  temporal  reuse  can  still  benefit  from 
spatial  locality  at  the  supenvord-register  level,  in  addition  to  the  cache  level. 

While  cache  optimizations  are  beyond  the  scope  of  this  paper,  we  observe  that  the  SI. I.  optimiza¬ 
tions  presented  here  can  be  applied  to  code  that  has  been  optimized  for  caches  using  well-known 
optimizations  such  as  unimodular  transfonnations.  loop  tiling  and  data  prefetching.  When  combin¬ 
ing  loop  tiling  for  caches,  superword-level  parallelism  and  superword-level  locality  optimizations, 
the  tile  sizes  should  be  lange  enough  for  superword-level  parallelism,  and  for  unroll-and-jam  and 
superw  ord  replacement  to  be  profitable. 

These  points  are  illustrated  by  way  of  a  code  example,  with  the  original  code  shown  in  fig¬ 
ure  1(a).  This  example  shows  three  optimization  paths,  figure  1(d)  optimizes  the  code  to  achieve 
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fort  i=0;  i<ii:  i++)  for(i=0;  i<n:  i++) 

for  (j=0;  j<n:  j-H-)  for  (j=0;  j <n:  j+=sws) 

a[i|Lii  =  a[i-l ll.il  *  b|il  ^  b[i+l|;  a[i]|,i:>sws-l]  =  a(i-l]Li:j+sws-l|  *  b[i]  ^  b[i^l|: 

(a)  Original  loop  nest.  (d)  After  superword- level  parallelism(j  loop). 

forti=0;  i<n;  i+=2)  for(i=0;  i<n;  i+=2) 

for  {j=0;  i<n;  j+-i-)  {  for  (j=0;  j<n:  j+=  sws)  { 

ali]Lii  =  ali-lll.il  *  b|i]  ^  bli+l|;  ali||i:j+sws-l]  =  ali-l][i:i+sws-l|  *  b|ij  '  b(i+l|; 

ali+l||jl=a|illj|*b|i-IM  b|i+21;  ali+l  |[i:j+sws- 1 1  =  a|i|l.i:j+svt's-l  |  *  bli^  1 1  t  b|i+2); 


(b)  Unroll-and-jam  on  the  example  in  (aKi  loop). 

tnipl  =b[0]; 
fort  i=0;  i<n;  i+=2)  { 
tmp2  =  b|i  +  I); 
tmp.'  =  b[i+2]; 
for(j=0;j<n:  i-H-)  { 

tnip4  =  ali-l][i)  *  tmpi  +  tnip2; 

3|  1 1  Li  I  =  tmp4  *  tmp2  +  tmp3; 
alillil  =tmp4; 

I 

tmpl  =  tmp.^; 

} 

(c)  .\fter  scalar  replacement  on  the  code  in  (c). 


(e)  Unroll-and-jam  on  the  example  in  (dHi  loop). 

tmpllO:sws-l|  =  b[0:sws-l]; 
stmpi  =  tmpl  10]; 
stmp2  =trapl|l]; 
field  =  2; 

for(i=0;  i<n;  i+=2)  j 

//  ’field’  denotes  an  inde.x  into  ’tmp  I ’  for  stnip.^ 
ifi  field  =  0) 

tmpllO:sws-l|  =  b|i+2:i+sws+l  j; 
stmp.'  =  tmp  I  Ifield); 
for  (J=0;  j<n;  i+=  sw's)  { 

tmp2|0:sws-l  I  =  ali-l|[i:j+sws-l|  *  stmpi  +  stmp2; 
ali+l  j[j:j+sws- 1  ]  =  tmp210:sws-l  j  *  stmp2  +  stnip.^; 
alij|i:i+sws-l  j  =  tmp210:sws-l]; 

1 

.stmpi  =stmp.i; 
stmp2  =  tmp  I  Ifield+lj; 
field  =  (tield+2)%sws; 

1 


(f)  .\fter  supervs'ord  replacement  on  code  in  (e) 


Figure  1 ;  I'.xample  code. 


siiperword-level  parallelism.  Here,  .stc.s,  an  ahbreviation  for  siiperword  si/e,  is  the  luimlter  of  data 
elements  that  lit  w  ithin  a  siiperword.  For  e.xample,  if  and  h  are  .■^2-hit  tloat  variables,  on  a  machine 
with  128-bit  siipeiAvords.  su\s  =  4.  In  Figures  1(b)  and  (c),  we  show  how  the  original  program 
can  instead  be  optimized  to  exploit  reuse  in  scalar  registers,  using  unroll-and-jam  and  scalar  re¬ 
placement.  respectively.  In  Figures  1(e)  and  (0.  we  combine  these  ideas,  using  unroll-and-jam  and 
superword  replacement,  respectively,  to  transform  the  code  in  (d)  for  both  superword-level  paral¬ 
lelism  and  superword- level  locality. 

■fable  1  shows  how  the  three  different  optimization  paths  alTect  the  number  of  anay  accesses  to 
memor\'  in  the  final  code,  fhe  original  code  has  ;r  reads  and  writes  to  array  n  and  2ii‘^  reads  to  array 
I).  F!xploiting  supeivvord-level  parallelism  in  loop  j,  as  in  Figure  1(d)  reduces  the  number  of  reads 
and  writes  to  array  n  by  a  factor  of  .suf.s  since  each  load  or  store  operates  on  sirs  contiguous  data 
items;  for  array  k  there  is  no  change  since  the  array  is  indexed  by  i  rather  than  j.  If  instead  the  code 
was  optimized  for  scalar  register  reuse,  as  in  Figure  1(c),  we  can  reduce  the  number  of  array  reads 
of  a  down  by  a  factor  of  2,  and  reads  of by  a  factor  of  /),  w  ith  the  number  of  w  rites  remaining  the 
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Original 

Scalar  register  reuse  only 

SFP  only 

SFP  and  SFF 

Figure  1(a) 

i'igure  1(c) 

Figure  1(d) 

Figure  1(f) 

11-  2  +  7/ 

0 

■Bnii 

l  ablo  I:  Number  of  array  accesses  under  ditTerent  optimization  paths. 


same.  By  combining  superword- level  parallelism  and  superword-level  kKality  as  in  Figure  1(0,  we 
see  that  the  number  of  reads  and  writes  is  further  reduced  by  a  factor  of  sirs.  I■■igure  1(f)  illustrates 
some  of  the  challenges  in  exploiting  reuse  in  superwords.  .Analysis  must  identify  not  just  temporal, 
but  also  spatial  reuse,  and  for  both  individual  statements  and  groups  of  references.  The  compiler 
also  must  generate  the  appropriate  code  to  exploit  this  reuse;  for  example,  we  select  scalar  fields  of 
l>  from  the  superword,  since  we  are  not  parallelizing  the  /  loop. 

rhe  remainder  of  this  paper  describes  how  the  compiler  automatically  generates  code  such  as  is 
shown  in  Figure  1(f),  and  the  performance  improvements  that  can  be  obtained  with  this  approach. 

3.  Overview  ofSuperword-I.evel  Locality  Algorithm 

rhe  superword-level  locality  algorithm  has  three  main  steps,  as  summarized  below.  Fkach  step  will 
be  described  in  more  detail  in  the  three  subsequent  sections. 

Step  1:  Identifying  Iteiise.  Hie  first  step  of  the  algorithm  is  to  identify  both  array  references 
and  loops  cariying  reuse,  fhe  array  references  carrying  reuse  are  the  ones  for  which  superword 
replacement  may  be  applicable.  Fhe  loops  carrying  reuse  are  the  ones  to  which  the  algorithm  will 
consider  applying  unroll-and-jam. 

Reuse  between  two  distinct  array  references  in  an  /^-dimensional  loop  nest  is  determined  from 

data  dependences,  in  the  form  of  dependence  vectors,  d  =  {ili.(l2 . </„)|23|.  .\  dependence 

vector  captures  the  vector  distance,  in  terms  of  the  loop  iteration  space,  such  that  the  two  references 
may  map  to  the  same  memory  location.  Flach  vector  element  may  be  either  a  constant  integer, 
)  (a  positive  direction  where  the  distance  is  not  fixed),  -  (a  negative  direction),  or  *  (the  direction 
and  distance  are  unknow  n).  We  refer  to  a  dependence  vector  as  being  lexicographically  positive  if 
the  first  non-zero  is  -f  or  a  positive  integer. 

For  the  purposes  of  reuse,  the  relevant  dependences  carrying  reuse  are  a  subset,  and  are  charac¬ 
terized  as  follows: 

1.  We  consider  only  true  dependences  (writes  followed  by  reads),  input  dependences  (reads 
followed  by  reads),  and  output  dependences  (writes  followed  writes).  .Although  output  de¬ 
pendences  do  not  capture  reuse  of  the  same  data  value,  they  suggest  an  opportunity  for  elimi¬ 
nating  unnecessary  writes  back  to  memoiy.  .Anti-dependences  (writes  followed  by  reads)  are 
not  considered. 

2.  We  consider  only  lexicographically  positive  dependences. 

.L  .A  dependence  vector  must  be  consistent,  i.e.,  the  dependence  distance  in  the  iteration  space 
must  be  constant,  or  it  must  be  invariant  with  respect  to  one  of  the  loops  in  the  nest. 
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for<  i=0;  i<N;  i+=4){ 
vecl[0;3]  =  A[i:i+3]; 
vec2[0;3]  =  A[i+8:i+ll]; 


} 


tmp[0:31  =  Ali:i+3]: 
vec2(0:3]  =  A[i+4:i+7]; 
for(i=0;  i<N;  i+^){ 
vec  l[0:3]  =  tiTip[0:3]: 
tmp[0:3)  =  vec2[0:3]; 
vec2[0:3]  =  A[i+8:i+l  1 1: 

I 


(a)  Original  (b)  After  exploiting  reuse 

l  igure  2;  Rouse  Across  Iterations 


Applying  unroll-and-jam  to  a  loop  /  w  ith  a  consistent  dependence  varying  with  respect  to  loop 
i  can  create  loop- independent  dependences  in  the  innermost  loop  of  the  unrolled  loop  body.  In  the 
example  in  figure  1(a),  there  is  a  true  dependence  between  references  A[j|[j]  and  .4[/  —  l][j|  with 
distance  vector  (1.0).  .\fler  unroll-and-jam,  a  loop-independent  dependence  is  created  between 
.4[/j|j]  in  the  first  statement  and  .4[/j[j]  in  the  second  statement  of  the  loop  body,  creating  a  reuse 
opportunity. 

In  addition  to  reuse  between  copies  of  a  reference  created  by  unrolling,  there  can  be  reuse  across 
loop  iterations.  References  with  consistent  dependences  carried  by  a  loop  have  group  reuse  which 
can  be  exploited  by  using  extra  registers  to  hold  the  data  across  iterations.  .-\s  in  previous  work  1 1‘)|, 
our  algorithm  exploits  reuse  across  iterations  of  the  innermost  loop  only,  because  exploiting  reuse 
carried  by  an  outer  loop  could  potentially  require  too  many  registers  to  hold  the  data  between  uses. 
Figure  2  shows  how  reuse  can  be  exploited  across  iterations  of  the  innermost  loop  by  using  one 
register  to  keep  the  data  that  is  reused  on  every  two  iterations. 

For  loop-invariant  references,  unroll-and-jam  generates  loop-independent  dependences  between 
the  copies  of  the  reference  in  the  unrolled  loop  body,  since  the  same  location  is  being  referenced  by 
each  copy. 

Step  2:  Doteimiiiiiig  unroll  fact<»rs  for  candidate  loops,  fhe  algorithm  next  determines  the 
unroll  factors  for  each  candidate  loop  that  carries  reuse,  as  previously  described,  and  for  which 
unroll-and-jam  is  legal.  The  optimization  goal  is  as  follows. 

Optimization  Coal:  Find  unroll  factors  (A'l.  A'2.  for  loops  1  to  11  in  an  //-deep 

loop  nest  such  that  the  number  of  memory  accesses  is  minimized,  subject  to  the  con¬ 
straint  that  the  number  of  superword  registers  required  does  not  exceed  what  is  avail¬ 
able. 


fhe  algorithm  determines  the  unroll  factois  ';A'i..V2.  ...A'„)  by  searching  for  the  combination 
of  unroll  faclois  that  satisfies  the  above  optimization  goal.  Fo  guide  the  search,  the  algorithm 
calculates  the  total  number  of  registers  required  for  exploiting  reuse,  which  is  the  sum  of  the  number 
of  superwords  accessed  by  the  references  in  the  loop  body  after  unroll-and-jam  is  applied,  plus  the 
number  of  registers  needed  for  holding  data  across  iterations  of  the  innermost  loop.  Section  4 
describes  how  the  algorithm  computes  the  total  number  of  registers  required  for  exploiting  reuse 
and  the  resulting  number  of  memory  accesses.  Section  5  describes  aspects  of  how  the  search  space 
is  navigated. 
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step  3:  Code  ri  itiisformations  -  I'ni'oll-Hiul-.IiUH.  Siipen>ord  Replacement,  and  Related  Optl- 
nii/ations.  Once  the  unroll  factors  are  decided,  iinroll-and-jam  is  applied  to  the  loop  nest,  .\rray 
references  are  replaced  with  accesses  to  siiperword  temporaries.  .\s  part  of  code  generation,  our 
compiler  performs  related  optimizations  to  reduce  the  number  of  additional  memory  accesses  and 
register  reriuirements  introduced  by  the  SI.P  passes.  These  code  transformations  are  the  topic  of 
Section  6. 

4.  Coniputin^  Registers  Required  and  Memory  Accesses 

I  his  section  presents  the  computation  of  the  number  of  registers  required  for  exploiting  data  reuse 
in  siiperword  registers  and  the  resulting  number  of  memory  accesses,  which  are  the  parameters  used 
to  guide  the  search  for  the  combination  of  unroll  amounts  to  be  applied  to  the  loop  nest.  I  he  next 
subsection  describes  how  the  algorithm  computes  the  siiperword  foolprint.  which  represents  the 
number  of  supervvords  accessed  by  the  unrolled  iterations  of  the  loop  nest  as  a  function  of  the  unroll 
factoi-s.  Subsection  4.2  presents  the  computation  of  the  extra  registei's  needed  for  reusing  data  across 
loop  iterations.  The  total  number  of  registers  and  the  corresponding  number  of  memory  accesses 
are  computed  in  subsection  4.3. 

4.1  Compiitiiig  the  Supensord  Footprint 

I  bis  section  presents  the  computation  of  the  superword  footprint  of  the  references  T  in  a  loop  nest, 
Fi{V),  arter  unroll-and-jam  is  applied  to  the  nest  with  unroll  factors  (A'l.  A'2 . A'„). 

The  algorithm  for  computing  the  superword  footprint  for  a  loop  nest  (list  partitions  the  refer¬ 
ences  in  the  loop  into  groups  of  unijormly  f'cnerafed references  |  IX|,  that  is,  references  to  the  same 
anay  such  that,  for  each  array  dimension,  the  array  subscripts  dilTeronly  by  a  constant  temi'.  fhen, 
for  each  group  of  references,  it  computes  the  number  of  superwords  accessed  in  the  unrolled  loop 
body,  finally,  the  total  number  of  supervvords  is  computed  as  the  sum  of  those  of  each  group  of 
unifomily  generated  references. 

We  fu  st  discuss  how  to  compute  the  siiperword  footprint  of  a  single  reference  as  a  function  of 
the  unroll  factoi-s  of  each  unrolled  loop.  Then  we  discuss  how  to  compute  the  siiperword  footprint 
of  a  group  of  uniformly  generated  references.  The  supeivvord  footprint  of  a  group  may  he  smaller 
than  the  sum  of  the  individual  fooptrints,  since  the  same  siiperword  may  be  accessed  by  two  or  more 
copies  of  the  original  references  when  the  loops  are  unrolled. 

Our  method  determines  the  number  of  siiperword  registers  required  to  hold  the  data  accessed  by 
the  loop  references  in  the  unrolled  loops.  However,  extra  registers  may  be  needed  to,  for  example, 
align  a  siiperword  operand  which  is  already  kept  in  siiperword  registers.  That  is,  the  computation 
may  require  more  registers  than  those  needed  for  storing  the  data.  Therefore,  we  reserve  some 
scratch  registers  for  manipulating  data  and  compute  the  number  of  registers  needed  just  for  storing 
the  data  accessed  in  the  unrolled  loops. 

To  simplify  the  presentation,  we  assume  a  loop  nest  of  depth  ji  where  all  array  references  have 
anay  subscripts  that  are  affine  functions  of  a  single  index  variable  (SI  V  subscripts)^.  We  also  assume 
that  each  /(-dimensional  array  referenced  by  the  loop  is  defined  as  -Tl.spH.sp  1] . . .  [.si],  where  .s*  is 

1.  We  pssume  tliat  two  or  more  references  that  access  the  same  array,  hut  are  not  uniformly  generated,  access  distinct 

data  in  memory',  which  results  in  a  conservative  estimate  of  the  number  of  superwords  accessed  by  the  group  and  of 

tile  number  of  registers  requited. 

2.  Our  current  implementation  can  handle  affine  SIV  subscripts  and  certain  affine  MIX’  subscripts. 
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(a)  li  =  1  and  ah  <  nws  (b)  h  ^  1 

I'igiiro  3:  Siipcrvvord  footprint  of  a  single  reference. 


the  si/e  of  dimension  h,  1  <  //  <  p.  Dimension  1  is  the  lowest  dimension  of  the  array,  i.e..  the 
dimension  in  which  consecutive  elements  are  in  consecutive  memory  kx'ations.  A  reference  c  to 
anay  A  is  then  of  the  form  ♦  Ip  |  /'p|[«p_i  *  Ip  it  /(p-i] . . .  [oi  *  li  +  /'i|-  fhus.  a  reference 
with  SIV  subscripts  has  each  array  dimension  h  associated  with  just  a  single  loop  index  sariable  in 
the  nest,  and  the  loop  index  variable  associated  with  h  is  represented  as  Ih-  We  also  assume  that  the 
an  ays  are  aligned  to  a  supei-word  in  memory  and  that  the  loops  are  normali/ed. 

4.1.1  Superword  Footprint  of  .x  stNOLE  Reference 

For  each  reference  r  with  an  ay  subscripts  ah  *  Ih  I  k  where  h  is  the  array  dimension  and  //,  is  the 
loop  index  sariable  appearing  in  subscript  h,  the  number  of  superwords  accessed  by  all  copies  of  c 
when  Ih  is  unrolled  by  is  given  by  the  siiperword  footprint  of  c  in  //i,  or  (c). 

When  dimension  h  is  the  lowest  array  dimension  (h  =  i),  the  superword  footprint  is  given  by 
liquation  ( t ).  liquation  ( ta)  conesponds  to  the  footprint  of  a  loop-invariant  reference,  liquation  ( I  b) 
corresponds  to  the  footprint  of  a  reference  with  self-spatial  reuse  within  a  superword,  as  illustrated 
in  Figure  .1(a),  and  ( Ic)  holds  w  hen  the  reference  has  no  spatial  reuse. 

1  (a)  if(//,=0 

=  '  ifoh<.s«'.s  (I) 

.  -^ih  W')  ifo/i  > 

When  li  is  one  of  the  higher  dimensions.  I  <  h  <  p,  and  loop  Ih  is  unrolled,  the  olTset  between 
the  footprints  of  each  copy  of  r  is  ah  *  flfj/  •'‘o  "  here  .Sj  is  the  si/e  of  the  array  dimension, 
as  shown  in  Figure  .1(b).  .Assuming  (hat  the  si/e  of  the  lowest  array  dimension  (.si)  is  larger  than 
•sw.s,  which  is  usually  the  case  in  practice  for  realistic  array  dimensions,  each  copy  of  r  in  the 
unrolled  loop  body  corresponds  to  a  separate  footprint,  as  shown  in  Figure  .1(b).  Therefore  (he  si/e 
of  the  footprint  of  r  in  //,  is  the  sum  of  the  disjoint  footprints,  and  is  recursively  defined  by 
liquation  (2),  where  Fl^{r)  is  computed  as  in  liquation  ( I ). 

('’) 

k 

=  <2) 
i=2 
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'’2)  =  < 


■'^ih  +  (^'2  ~  ln)/<>k  (a)  if  tih  >  •'i'i’.'i  and  {I12  —  l>i)  <  i‘h  *  A’/^  and 

(/)2  — /»i)  niod  H/j  0 

[(“h  *  A'f^  \  l>2  —  hi)/sir.s']  (h)  if  (1/,  <  sirs  and  {l>2  —  In)  <  *  A'/^ 


Fi^Un)  +  f\{r2) 


BMf«tw=ri  Ceotpcai 


(a)  2  Bws  and 

and  (bj'b,)  mod  a^-O 

■uparMrd 


(c)  otherwise 


(-M 


figure  4;  Siiperword  footprint  of  a  group  of  references. 


for  a  single  reference,  the  number  of  supenvord  registers  required  to  keep  the  superword  foot¬ 
print  given  by  fquation  ( I )  and  the  number  of  scalar  registers  that  would  be  required  if  the  same 
unroll  factors  were  used  dilTer  only  when  <  xirs,  that  is,  w  hen  spatial  reuse  can  be  exploited  in 
superword  registers,  for  a  group  of  uniformly  generated  references  the  analysis  must  also  consider 
group  reuse,  as  discussed  next. 

4.  1 .2  SUPERW’ORD  fOOTPRINT  OF  A  GROUP  OF  RFFERENCFS 

1  he  number  of  superwords  accessed  by  a  group  of  uniformly  generated  references  =  {ri.  ('2 . 1'm  1 

when  loop  1^  is  unrolled  by  A';^  is  the  superword  footprint  of  the  group,  Fi^(\').  The  superword 
footprint  of  a  group  consists  of  the  union  of  the  footprints  of  the  individual  references,  as  some  of 
the  reference  footprints  may  overlap,  depending  on  the  distance  between  the  constant  terms  in  the 
array  subscripts. 

The  footprints  of  two  uniformly  generated  references  may  overlap  in  dimension  h  only  if  they 
overlap  in  all  dimensions  higher  than  h.  for  e.xample,  the  footprints  of  references  .-l|2i][j  +  ‘A 
and  [2i  +  l][j]  do  not  overlap  in  the  highest  (row  )  dimension,  since  the  first  reference  accesses  the 
e\en-numbered  rows  of  the  array  and  the  second  accesses  the  odd-numbered  rows,  fherefore  the 
footprints  cannot  overlap  in  the  lowest  (column)  dimension.  On  the  other  hand,  the  footprints  of 
.•l[2/|[j  +  2]  and  .4[2i  +  IJLy]  overlap  in  the  row  dimension  for  iterations  /d.  la,  1  <  u.  in  <  Ad. 
such  that  2/d  =  2/2  t  f  for  the  iterations  of  /  in  which  the  footprints  oserlap  in  the  row  dimension. 


379 


the  Iboipi  inls  may  overlap  in  the  column  dimension  if  there  exist  iterations  1  <  ji-h  < 
such  that  ji  -h  2  =  72- 

Hie  superword  footprint  Fl{V)  of  a  group  \  \  following  unroll-and-Jam.  is  computed  as  fol¬ 
lows.  first,  the  anay  dimensions  with  array  subscripts  that  are  a  function  of  any  of  the  unrolled 
loops  are  identified.  I  hen,  for  each  such  dimension  h,  from  highest  to  lowest  dimension,  the  foot¬ 
print  is  computed  assuming  that  the  footprints  of  the  references  in  the  group  overlap  in  the  higher 
dimensions.  For  each  dimension  h  >  1,  the  algorithm  partitions  references  into  subsets  such  that 
each  subset  corresponds  to  a  disjoint  footprint  in  dimension  h.  1  hen,  for  each  subset,  the  algorithm 
recursively  computes  the  footprint  in  dimension  //  —  1,  as  we  now  describe. 

Dimension  h  is  the  lowest  dimension  (h  =  1).  We  first  compute  the  group  footprint  of  two  array 
references,  and  then  we  extend  it  for  m  references.  The  footprint  of  group  V  —  {(•1.C2},  where 
references  cj  and  ('2  have  lowest  dimension  subscripts  ♦  Ih  f  and  tih  *Ih  I  ^'2  hy  <  bo, 

w  hen  loop  Ih  is  unrolled  by  .Vi;,  is  given  by  iaiuation  (3)  in  Figure  4.  Fqualions  (4a)  and  (.4b)  apply 
when  the  two  footprints  overlap,  that  is,  when  (/»2  —  In)  <  uh  *  Xi^,  as  shown  in  Figures  4(a)  and 
(b).  When  the  footprints  do  not  overlap,  the  group  footprint  is  the  sum  of  the  individual  footprints, 
as  in  l-Aiuation  (.4c),  with  examples  in  Figures  4(c)  and  (d). 

In  i'igure  4(a),  the  references  have  no  self-spatial  reuse,  that  is.  <ih  >  sirs,  and  each  individual 
footprint  is  a  set  of  .V/;,  supenvords.  I  he  footprints  overlap  if  (/>2  —  In)  is  evenly  divided  by  »/,  and 
there  exists  an  integer  value  k,  I  <k  <  Xi^,  such  that  k  =  I  +  (/>2  —  /n)/«/i-  fhis  ca.se  corresponds 
to  [filiation  (.4a).  which  computes  the  group  footprint  precisely  when  the  two  references  have 
group-temporal  reuse.  In  Figure  4(b).  both  references  have  self-spatial  reuse  within  a  supenvord. 
that  is,  iih  <  sirs.  I'he  corresponding  footprint  si/e  is  given  by  l:quation  (.4b).  In  Figure  4(c),  in  has 
no  self-spatial  reuse  and  each  copy  of  cj  in  the  unrolled  loop  body  accesses  a  distinct  supenvord. 
and  the  same  is  true  for  (’2.  In  ['igitre  4(d)  both  ri  and  /•2  have  self-spatial  reuse. 

■fhe  footprint  of  a  group  V  —  {ri.r2 . Cm}.  with  array  subscripts  hj  *  li  +  Ih  such  that 

1  <  /  <  w  and  In  <  l>2  ^  •••  ^  I'm^  is  computed  by  first  pailitioning  into  subgroups  with 

disjoint  footprints  in  the  lowest  dimension,  as  follows.  .\  subgroup  Vj  =  { , . in^^  } 

is  defined  by  lowest  dimension  subscripts  (ii  *  /i  +  l>j,  w  here  Vj.  ijnin  <  J  <  'mar' 

(bj.i  <  hj)  A 

(bj  —  bj-i  <  )  A 

(bi  .  =  />i  V  hi  —  hi  ■  *  Xi,  )  A 

(l>i,na.  =  f'm  V  +  i  ♦  A';, )  (4) 

Then  the  group  footprint  I'  is  computed  as  the  sum  of  the  disjoint  footprints  of  sets  Vj,  as  in 
(5). 


fi'O)  =  EAO'i)  15) 

i 

fhe  footprint  of  each  subgroup  Vj  is  computed  by  extending  F-qualion  (4)  to  m  >  2  references, 
i'or  example,  when  the  references  in  V'  have  self-spatial  reuse,  as  in  Fquation  (4b)  (hi  <  .s-h-.s),  each 
subgroup  V  j  has  a  footprint  consisting  of  contiguous  superwords,  since  hj  —  hj_i  <  hj  *  .V;,  for  all 
j  such  that  ijnin  <  J  <  'mar-  fli'^  footprint  of  V  j  consists  of  the  union  of  the  individual  footprints. 
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with  si/c  given  by  I{quation  (6). 


Plh  i)  —  ^lh({  '  »m.n . '’irnax  I  ) 

(1 1  *  A'/,  +  hi _ _  —  bi  . 

1  t|  «tna.r  *tnin  ( 0 ) 

sirs 

For  example,  if  .sh-.s  =  4  and  X  =  4,  group  =  {Al/I.  .4[/  +  2|.  .4[/  +  5].  .4[/  +  12].  .4|(  +  14]] 
can  he  partitioned  into  two  subgroups  =  ].4|;]..4[(  +  2|..4[?  +  .5|}  and  l2  =  12|..4[/4  14]} 

with  disjoint  superword  footprints.  Since  the  references  have  self-spatial  reuse,  each  indixidual 
footprint  and  the  footprint  of  each  subgroup  is  a  set  of  contiguous  supewords.  I  he  total  number  of 
superwords  accessed  by  the  references  in  T  is  the  sum  of  the  disjoint  footprints  of  sets  1  j  and  \  9, 
as  in  (7). 


FiAV)  =  F,,(\\)  +  Fi,{Vo)  = 


4  +  5-()‘ 

1 

"1*4  4  14-12' 

4 

t 

4 

(7) 


Diiiieiision  h  is  not  the  lowest  dimeiision  (/;  >-  1).  When  h  is  one  of  the  higher  dimensions,  the 
superword  footprint  of  1’  =  ■{  ci.  1-2 . |  in  loop  is  again  the  union  of  the  individual  footprints. 

From  Section  4.1.1,  the  footprint  of  each  reference  rt  in  the  unrolled  loop  body  consists  of  a 
set  of  A'/^  disjoint  footprints  (each  footprint  corresponding  to  a  copy  of  Cj  created  by  unrolling), 
and  the  olTset  between  each  pair  of  consecutive  footprints  is  ah  *  nf=/  .sj,  where  .sj  is  the  size  of 
dimension  i. 

I  herefore  the  footprints  of  dilTerent  references  in  the  group  may  overlap,  depending  on  the 
values  of  «yj,  hj  and  the  unroll  factor  Xi^.  Fhe  footprints  of  two  uniformly  generated  references 
Cl  and  1-2  overlap  in  dimension  h  if  there  exists  an  integer  value  A-,  1  <  A-  <  A'/^  that  satisfies 
Condition  (8); 

(ih  *  A-  C  hi  =  (ih  I  1)2.  (8) 

that  is,  if  (/»2  —  hiY7<nh  —  <>  iind  {b2  —  F  1  <  A';^.  Furthermore,  if  there  exists  A-  satisfying 

the  above  condition,  the  footprints  of  the  last  A';^  —  A-  4  1  copies  of  rj  in  the  unrolled  loop  body 
overlap  with  those  of  the  first  A';^  —  A-  t  1  copies  of  (•2-  fhv'  footprint  of  { 11 .  /•2 }  is  then  given  by 
l-Ajuation  (9). 


Fihit’i- <'2)  =  (A-  -  1)  *  F/^_|  (I’l)  +  (Xi^  -  A-  +  1)  (I'l.  1-2)  +  (A-  -  1)  *  F/^_,  (l■2)(9) 

To  compute  the  size  of  the  entire  footprint  of  in  //,,  our  algorithm  partitions  1’  into  subsets 

'  i  =  { '’tm.n . I  for  all)'  j,  ijnin  <  j  <  //noi,  the  pair  { i-j.i ,  rj  \  satisties  C ondi- 

tion  (8).  I  he  footprint  of  Tj  is  the  union  of  the  footprints  of  its  reference  set  and  is  computed  by 
extending  Ivquation  (9)  to  more  than  two  references. 

4.2  Registers  for  Reuse  .Veross  Iterations 

In  addition  to  superword  registers  for  exploiting  reuse  in  the  body  of  the  transformed  loop  nest,  extra 
superword  registers  may  be  required  for  exploiting  reuse  across  iterations  of  the  innermost  loop  for 
references  w  ith  group-temporal  reuse  carried  by  the  innermost  loop  11  of  the  transformed  loop  nest. 

To  compute  the  number  of  registers  needed  to  exploit  group-temporal  reuse  across  iterations 
of  loop  71,  the  algorithm  examines  groups  of  references  that  have  consistent  dependences  can  ied 
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by  11.^  Assume  that  iinroll-and-jam  has  been  applied  to  outer  loops  in  a  nest.  .M'tor  subsequently 
unrolling  the  innermost  loop,  extra  registers  are  required  if  the  reuse  distance  between  references 
prior  to  unrolling  loop  v  is  larger  than  the  unroll  amount,  i.e.,  if  </„  :>  A*„,  as  in  figure  2,  where 
(In  =  8  and  A'r  =  4. 


Let  C  =  {(’1. 1-2.  .../V,  [  be  a  set  of  references  that  is  a  subset  of  a  unifomily  generated  set,  and, 
prior  to  unrolling  the  innermost  loop  resulting  from  unroll-and-Jam  by  A'„,  each  pair  (cj.  in  C 

has  a  consistent  dependence  (P  =  (0. 0 . f/jj),  :>  0.  Also,  assume  that  the  array  subscript  of  the 

lowest  dimension  of  each  reference  Cj  in  C  is  of  the  fomi  +  /)j,  and  that  /q  <  ho  <  ...  <  hm- 
Unrolling  loop  n  generates  A'„  copies  of  each  original  reference  in  the  body  of  the  transformed 
loop  nest. 

When  (/jj  is  a  multiple  of  the  unroll  factor  A'„,  each  pair  of  copies  of  references  (cj,  will 
reuse  data  after  iterations.  When  </{,  is  not  a  multiple  of  A'n,  some  copies  of  a  reference  will 


reuse  data  afler  —  1  iterations  of while  others  will  have  a  reuse  distance  of  requiring 
one  more  register  per  copy.  'fhus.  each  pair  of  copies  of  references  (oj,  cj+i;  requires  at  most 
—  1  additional  superword  registers  to  keep  the  data  across  iterations  of  the  innermost  loop. 

I  he  number  of  registers  required  to  exploit  reuse  across  iterations  of/;  by  all  pairs  of  copies 
is  the  number  of  registers  required  for  each  pair  times  the  number  of  registers  required  to  keep  the 
superword  footprint  of  reference  in  in  the  transformed  loop  nest: 


'Pn 


Xn 


-  1)  X  Flirt) 


(10) 


liquation  (10)  may  overestimate  the  number  of  registers  if  the  footprint  component  (FL(ct))  over¬ 
estimates  registers,  or  for  certain  copies  of  references  if</j,  is  not  a  multiple  of  A'n. 

I  he  total  number  of  registers  required  for  exploiting  reuse  across  iterations  for  set  C  with  lead¬ 
ing  reference  /-i  is  given  by: 

FaIC)  =  5]  (( 

l<t</n  \ 


A'/, 


-l)xFL(ri) 


II) 


4.3  Puttin':  It  .VII  Together 

Subsections  4.1  and  4.2  describe  the  computation  of  the  number  of  registers  required  to  exploit 
reuse  in  the  body  of  the  innermost  loop  (superword  footprint)  and  across  iterations  of  the  inner¬ 
most  loop,  assuming  that  unroll-and-iam  has  been  applied  the  loop  nest.  I  bis  section  presents  the 
computation  of  the  total  number  of  registers  required  and  the  total  number  of  memory  accesses  in 
the  innermost  loop  of  the  transformed  loop  nest,  which  are  the  metrics  used  to  prune  and  guide  the 
search  for  unroll  factoi-s  described  in  Section  ?. 

1  he  total  number  of  registers  required  to  exploit  reuse  is  the  sum  of  the  superword  footprint  of 
the  references  in  the  innermost  loop  of  the  transformed  loop  nest  and  the  number  of  registers  needed 
for  exploiting  reuse  across  iterations  of  the  same  innermost  loop. 

I  he  superword  footprint  of  the  references.  Fi(\  ’),  is  computed  as  in  subsection  4.1.  fhe  total 
number  of  extra  registers  required  for  exploiting  reuse  across  iterations  of  the  innermost  loop  is 

.V.  Note  that  such  references,  if  their  lowest  dimension  varies  with  n,  may  also  have  group-spatial  reuse  across  loop 
iterations.  However,  our  algorithm  focuses  on  exploiting  group-temporal  reuse  across  iterations,  since  most  of  the 
group-spatial  reuse  is  achieved  within  the  body  of  the  unrolled  loop. 
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conipiitod  as  in  subsection  4.2,  lor  each  set  C' of  loop-variant  references  with  consistent  dependences 
carried  by  the  innermost  loop. 

The  total  number  of  superword  registers  required  is  then: 

mv)  =  Fl(V)-^y,i^a{C)  (12) 

c 

The  total  number  of  memory  accesses  in  the  innennost  loop  of  the  transformed  loop  nest  is  the 
sum  of  the  memory  accesses  of  each  group  C  of  references  that  are  variant  w  ith  the  innennost  loop 
II  and  have  consistent  dependences  carried  by  u.  Tor  each  group  C,  the  number  of  memory  accesses 
is  given  by  the  superword  footprint  of  the  leading  reference  of  the  group,  c^: 


M{C)  =  FL{t 


,c 

1 


(13) 


rhe  total  number  of  memoiy  accesses  is  then: 


M{V)  = 

c 


(14) 


5.  Search  Algorithm 

As  previously  stated,  the  goal  of  the  search  algorithm  is  to  identify  the  unroll  factors  for  the  loops  in 
the  loop  nest  such  that  the  number  of  memory  accesses  is  minimi/ed,  w  ithout  exceeding  available 
registers.  I  hus,  we  must  consider  an  /? -dimensional  search  space,  where  each  dimension  has  the 
number  of  elements  corresponding  to  the  iteration  count  of  the  loop.  .\  full  global  search  of  this 
search  space  is  prohibitively  expensive,  especially  for  deep  loop  nests  or  large  loop  bounds,  fhus, 
we  use  a  number  of  strategies  for  pruning  the  search  space. 

I'irst,  we  eliminate  from  the  search  loops  that  do  not  canw  reuse  or  for  which  unroll-and-jam  is 
not  safe.  ITirther,  we  rely  on  the  observation  that  the  number  of  registers  required  monotonically 
increases  w  ith  the  unroll  factor  of  a  loop,  assuming  that  all  other  unroll  factors  are  fixed,  fhus, 
we  need  not  search  beyond  the  unroll  factoi-s  that  exceed  available  registers,  fhis  latter  point  sig- 
niiicantly  prunes  the  search  space  in  that  the  number  of  registers  is  usually  fairly  small  (c.^.,  32 
superword  registers  on  the  AltiVec),  so  that  the  search  is  concentrated  on  fairly  small  unroll  factors, 
fhese  pruning  strategies  are  used  in  our  current  implementation,  and  at  least  for  the  programs  in 
this  study,  are  quite  elective  at  making  the  search  practical. 

ITinher  pruning  is  possible  by  making  the  additional  observation  that  for  each  unrolled  loop 
/,  the  amount  of  reuse  of  an  array  reference  with  reuse  carried  by  /  increases  with  the  unroll  fac¬ 
tor  A';,  rherefore  reuse,  like  the  register  requirement  calculation,  is  a  monotonic,  non-decreasing 
function  of  the  unroll  factor  for  each  loop,  given  that  the  unroll  factor  of  all  other  loops  is  fixed, 
fhus.  within  each  dimension,  holding  all  other  unroll  factors  constant,  binary  search  can  be  used 
rather  than  searching  all  points.  We  can  also  increase  unroll  factors  by  amounts  corresponding  to 
the  supenvord  size  without  much  loss  of  precision,  rather  than  considering  each  possible  unroll  fac¬ 
tor,  since  the  register  requirements  increase  stepwise  as  a  function  of  superword  size.  .Additional 
pruning  techniques  that  take  into  account  the  hardware’s  capability  to  take  advantage  of  the  results 
of  optimization  have  been  used  in  prior  work  1 10,  24|. 
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Our  implementation  navijiates  the  search  space  from  innermost  loop  to  outemiost  loop,  for  the 
applicable  loops  in  the  nest,  vaiying  the  unroll  factor  of  one  loop  while  keeping  the  unroll  factors  of 
all  other  loops  fixed.  Within  a  dimension  of  the  search  space,  the  lowest  number  of  memory'  accesses 
will  be  derived  at  the  largest  unroll  factor  that  meets  the  register  constraint.  However,  lower  unroll 
tactoi-s  may  also  have  the  same  estimate  of  memory  accesses  (becau.se  reuse  is  monotonically  non¬ 
decreasing),  so  we  identify  the  lowest  unroll  factor  with  the  equivalent  estimate  of  memory  accesses. 

1  hen.  the  implementation  considers  the  next  applicable  outer  loop  and  the  applicable  inner  loops 
nested  inside  it,  and  in  a  particular  dimension,  each  time  it  reaches  the  largest  unroll  factor  that 
meets  the  register  constraint,  it  compares  the  estimated  number  of  memory  accesses  to  the  lowest 
estimate  so  tar  to  determine  if  a  better  solution  has  been  found.  I  he  final  result  of  the  algorithm  is 
the  unroll  factors  corresponding  to  the  best  solution. 

As  a  subtle  point,  when  unroll-and-jam  is  applied  from  outermost  to  innemiost  loop,  unrolling 
the  inner  loop  does  not  atTect  data  access  patterns  or  reuse  distance,  for  this  reason,  inner  loop 
unrolling  is  not  performed  in  earlier  work  1 19|.  hi  our  context,  however,  because  of  the  relationship 
between  superword-level  parallelism  and  supeiAvord  replacement,  inner  loop  unrolling  exposes  op¬ 
portunities  for  superword  loads  and  stores  and  thus  can  impact  the  analysis  of  register  requirements. 
Ncnertheless.  w  hen  reuse  is  exploited  across  iterations  of  the  innermost  loop  body  as  described  in 
Section  4.2,  it  is  not  necessary  to  unroll  the  innermost  loop  beyond  the  superword  si/e  to  achieve 
the  goal  ofconsidering  register  requirements  in  conjunction  with  superword-level  parallelism.  Note, 
however,  that  smaller  unroll  factoi-s  for  the  innemiost  loop  may  be  selected,  if  an  unroll-and-jam  of 
an  outer  loop  carries  more  parallelism  and  reuse. 

Although  this  search  should  theoretically  find  the  optimal  solution,  according  to  our  optinii/a- 
tion  criteria,  in  tact  the  solution  is  not  guaranteed  to  result  in  the  fewest  number  of  niemorv'  accesses, 
for  a  number  of  reasons,  first,  in  a  few  cases  as  noted,  the  register  requirement  analysis  defined  in 
the  previous  section  must  conser\atively  approximate.  Second,  it  is  difficult  to  estimate  the  register 
requirements  used  to  hold  temporaries,  so  w  e  conser\atively  approximate  this  as  well.  I  bird,  there  is 
a  tradeolT between  using  extra  registers  to  hold  \alues  across  iterations,  as  discussed  in  Section  4.2, 
versus  using  them  to  actually  exploit  reuse  within  the  traiisfomied  innermost  loop  body.  In  fact,  in 
general  the  algorithm  does  not  take  into  consideration  the  amount  of  reuse  resulting  from  perform¬ 
ing  superword  replacement  on  specific  references;  replacing  some  references  has  more  impact  on 
decreasing  memory  accesses  than  others. 

I  his  section  and  the  previous  one  have  described  how  the  compiler  analyzes  the  code  to  identify 
reuse,  register  requirements  and  the  unroll  factors  leading  towards  the  low  est  number  of  memory 
accesses.  In  the  next  section,  we  describe  how  these  analyses  are  used  in  transforming  the  code  to 
achieve  the  desired  result. 


6.  Code  (leiieratioii 

In  the  previous  section,  we  showed  how  consideration  of  superwords  instead  of  scalar  variables 
greatly  increases  the  complexity  of  determining  the  number  of  registers  and  niemoiy  accesses  asso¬ 
ciated  with  exploiting  reuse  under  dilVerent  unroll  amounts.  In  this  section,  we  further  discuss  the 
increased  complexity  of  code  generation  when  performing  superword  replacement  instead  of  scalar 
replacement.  I  he  chief  source  of  code  generation  complexity  is  the  need  for  superword  objects  to 
be  properly  aliened,  as  in  the  following  examples. 
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Wlien  porforming  memory  operations,  the  architecture  may  actually  require  that  an  access  be 
aliyned  at  supervvord  boundaries.  For  e.xample,  the  .\lti\ec  ignores  the  last  four  bits  of  an  address 
w  hen  performing  a  supenvord  load  or  store.  In  such  an  architecture,  when  an  access  is  not  aligned 
at  a  supervvord  boundary,  the  compiler  or  programmer  must  readwrite  two  adjacent  superwords.  .\ 
series  of  additional  instructions  packs  the  tw  o  superwords  for  reads  or  unpacks  a  superw  ord  into  its 
corresponding  two  superwords  for  writes.  liven  on  architectures  that  support  memory  accesses  not 
aligned  at  superword  boundaries,  such  as  Intel's  SSF,  there  is  a  performance  penalty  on  unaligned 
accesses  because  the  hardware  must  perform  this  realignment. 

To  perform  an  arithmetic  or  logical  operation  on  two  superword  registers,  the  fields  of  the  two 
operands  must  also  be  aligned.  For  example,  to  add  the  third  and  fourth  lields  of  one  superword 
register  to  the  first  and  second  lields  of  another,  one  of  the  registers  must  be  shifted  by  two  lields. 
Consider  also  the  follow  ing  example: 


for  i  =  1,  n 

c  [i]  =  a  [2i]  +  t)  [i] 


fhe  access  to  a  has  a  stride  of  2.  while  the  access  to  b  has  a  unit  stride.  I  hus.  the  compiler  or 
programmer  must  first  pack  the  even  elements  of  a  into  a  supervvord  register  before  adding  them  to 
the  elements  of  b.  A  third  example  occurs  when  exploiting  partial  reuse  of  a  supervvord  where  data 
in  a  register  must  be  aligned  to  accommodate  the  next  operation. 

In  the  SI.P  compiler,  the  default  solution  to  alignment  involves  packing  data  through  memory, 
fhe  SI.P  compiler  alliKates  supervvord  variables  by  declaring  them  using  a  special  vector  type 
designation,  which  is  interpreted  by  the  backend  compiler  to  align  the  beginning  of  the  v  ariable  to  a 
supervvord  boundary  in  memory,  fhe  start  of  each  dimension  of  an  array  of  such  objects  should  also 
be  aligned,  by  padding  if  necessary.  Under  these  assumptions,  the  SI.P  compiler  can  detect  when 
operations  are  unaligned,  l.'naligned  data  is  packed  into  an  aligned  superword  in  memory  before 
being  loaded  into  a  supervvord  register,  and  is  unpacked  before  storing  back  to  memory.'* 

In  summary,  alignment  is  a  key  consideration  in  code  generation,  and  the  overhead  of  perfomi- 
ing  alignment  operations  can  be  quite  high.  Further,  alignment  operations  may  require  a  number 
of  additional  supeivvord  registers,  and  in  some  cases,  may  result  in  additional  accesses  to  memoiw 
not  accounted  for  by  the  model  in  the  prev  ious  section.  In  this  section,  we  show  how  to  achieve  the 
number  of  registers  derived  by  our  model  through  a  set  of  code  transformations,  presented  in  the 
order  in  which  they  are  performed  by  our  compiler.  In  addition  to  superword  replacement,  described 
in  Section  6.2,  we  also  describe  how  index  set  splitting  is  used  to  align  accesses  to  the  beginning  of 
an  iteration  in  Section  6. 1,  and  how  our  compiler  eliminates  additional  memory  accesses  resulting 
from  packing  through  memory  for  alignment  in  Section  6..U  We  illustrate  how  these  transfomiations 
collaborate  with  each  other  by  vviiy  of  an  example  in  Figure  5,  which  is  a  simplified  FIR  filter. 

6.1  Index  Set  Splittiii}; 

.'\  simple  way  to  reduce  the  need  for  alignment  operations,  when  applicable,  is  to  perform  index 
set  splitting  on  loops.  For  e.xample,  in  Figure  5(b),  the  initial  access  to  out  [1]  refers  to  the 

4.  For  architectures  tliat  support  copying  between  scalar  and  superword  register  files,  such  as  Intel's  SSE  and  DIV'.X. 
tliis  packing  can  be  perfomied  more  efficiently  through  register  copies. 
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1)  for(i  =  I;  i  <  64;  i-H-) 

2)  out[i]=0.0; 

3) 

4)  for  (i  =  256;  i  <  320;  i-H-) 

5)  for(j  =0;  j  <  256;  j-H-) 

6)  out|i-256]  =out[i-256]  +  in[i-j]  *  coe[j]; 

(a)  Original 


1 )  for(i  =  I;  i  <  4;  i-H-){ 

2)  out[i]  =  0.0; 

3)  I 

4)  for(i  =  4;  i  <  64;  i-H-){ 

5)  out|i]=0.0; 

6)  t 

7)  for(i  =  256;  i  <  320;  i-H-){ 

8)  for(j  =  0;  j  <  256; 

9)  oiilfi  -  256|  =  out(i  -  256|  +  in[i  -  j|  *  coe[j]; 

10)  I 

ID  t 


(b)  After  index  set  splitting 

1 )  for(i  =  I;  i  <  4;  i-n-)| 

2)  out[i]  =  0.0; 

3)  I 

4)  for(i  =  4;  i  <  64;  i -t=  4)| 

5)  outfi 0]  =  0.0; 

6)  out[i  +  I]  =0.0; 

7)  out[i -I- 2]  =  0.0; 

8)  out[i -t  3]  =  0.0; 

9)  ( 

10)  for(i  =  256;i<  320;i-i-=8){ 

11)  for  (j  =  0;  j  <  256;  j  -t=  8 ) j 

12)  out[i  +  0  -  256]  =  out[i  0  -  256]  +  in[i  +  0  -  (j  0)]  *  ccelj  +  0]; 

13)  out[i  0  -  256]  =  out[i  -i-  0  -  256]  -i-  in[i  -h  0  -  (j  I )]  *  coe]j  -Hi]; 

14) 

15)  out[i  +  7  -  256]  =  out[i  +■  7  -  256]  +  in[i  +  7  -  (j  -t-  7)]  *  coe]j  h  7]; 

16)  I 

17)  I 


(c)  After  unroll-and-jani 

I'igiirc  5;  Code  Cienoralion  I;xample 


second  field  of  a  siiperword.  assuming  out[0]  is  aligned  al  a  siipervvord  houndary.  fhrough 
index  set  splitting,  the  portion  of  the  loop  from  line  4-6  will  always  perform  aligned  accesses,  fhis 
transformation  is  always  safe,  and  is  profitable  whenever  it  increases  the  number  of  aligned  memory 
accesses. 

We  assume  index  set  splitting  is  performed  prior  to  the  SI. I’  compiler.  I  he  loop  is  transformed  so 
that  accesses  corresponding  to  a  particular  reference  in  the  main  loop  body  are  aligned  to  superword 
boundaries.  If  there  are  multiple  references  and  dilTerent  choices  for  index  set  splitting  are  needed 
to  align  specific  references,  we  select  a  representative  reference  that,  if  aligned  through  index  set 
splitting,  w  ill  also  maximize  alignment  for  other  references,  fhe  reference  selected  must  have  unit 
stride  w  ithin  the  innermost  loop. 

Let  i  be  the  loop  index  variable  for  the  innermost  loop,  and  lb  and  uh  are  the  lower  and  upper 
bounds  for  i.  To  derive  the  loop  bounds  for  the  copies  of  the  innermost  loop  resulting  from  index 
set  splitting,  we  begin  with  the  starting  address,  mlih\  of  the  reference  when  i  =  Ih,  where  atUr  = 
hiise  +  offk’l.  Here,  hnsc  refers  to  the  beginning  of  the  lowest  dimension  of  the  selected  array,  and 
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1) 

2) 

3) 

flat  1  =  *!!  float  *)&vecO  +  3); 
flat2  =  *!!float  *)&vec  1+0); 
flat3  =  *!!float  *)&vec  1  +  1 ); 

1) 

flat  1  =  *!!  float  *  )&vec0  +  3 ) 

4) 

flat4  =  *!!float  *)&vec  1+2); 

2) 

flat2  =  *(!  float  *)&vecl  +  0) 

5) 

*!! float  *)&vec2  +  0)  =  flat  1; 

3) 

flat.' =  *!!  float  *)&vecl  ^  1) 

6) 

*!!float  *)&vec2  +  1 )  =  flat2; 

4) 

flat4  =  *(!  float  *  )&vec  1  +  2 ) 

7) 

*!!float  *)&vec2  +  2)  =  flat3; 

5) 

*!!float  *)&vec2  +  0)=  flatl 

8) 

*!!float  *)&vec2  +  3)  =  flat4; 

6) 

*(! float  *)&vec2  +  l)  =  flat2 

9) 

vec4  =  sec-add!  vec3,  vec2); 

7) 

*!!  float  *)&vec2  +  2)=  flat3 

10) 

vec^t!vec4.  i  *  4  +  0,  !float  *)&out|-63)); 

8) 

*!!  float  *)&vec2  +  3)=  flat4 

ID 

vec5  =  vecJd!i  *  4, !  float  *  )&ciiit[-63]); 

9) 

\'ec4  =  vec-add!  vec3,  vec2); 

12) 

flats  =*!!float  *)&vec6  +  2); 

10) 

flats  =  *(!  float  *)&vec6  +  2); 

13) 

flatt)  =*!!  float  *)&vec7  +  2); 

ID 

flat!'  =  *!!  float  *  )&vec7  +  2 ); 

14) 

*!!tloat  *)&vec8  +  0)  =  flatS; 

12) 

*!!  float  *)&vec8  +  0)=  flalS; 

15) 

*!!t1oat  *)&vec8  +  1 )  =  Ilat6; 

13) 

*(!  float  *)&vec8  +  l)  =  flat6; 

(d)  After  SLP  compilation  (e)  After  superword  replacement 

1)  tempi  =  replicate! vecO.  3); 

2 )  temp2  =  repl  icate!  vec  1,0); 

3 )  temp.'  =  repl  icate(  vec  1,1); 

4)  temp4  =  replicate!  vec  I,  2); 

5)  vec2  =  shift_and_load! tempi,  temp  1, 4); 

6)  vec2  =  shiftjandJoad!vec2,  temp2, 4); 

7)  vec2  =  shiftjandJoad!vec2,  temp.',  4); 

8)  vec2  =  shiftjandJoad!vec2,  temp4, 4); 

9)  vec4  =  vecjdd!v'ec.3,  vec2); 

10)  tempi  =  replicate! vec6,  2); 

12)  temp2  =  replicate! vec 7,  2); 

1 1 )  vec8  =  sliift_ancLload!  temp  I ,  temp  1,4); 

13)  vec8  =  shiftjandJoad!vec8,  temp2,  12); 


!f)  .After  packinj;  in  registers 

I'iyiirc  5:  Code  Gonoralion  I'.\iimple(C'ontiiuiecl) 


olTset  is  the  olTset  within  that  dimension.  (Recall  that  the  heyinning  of  each  dimension  is  aligned  at 
siiperword  boundaries. ) 

The  lower  bound  (split)  of  the  main  loop  body  is  computed  by  the  follow  ing  equation. 

_  f  III  \f  ofjsel  mod  su'i  =  0  ^  I 

III  +  sws  —  {offset  moJi  sws)  if  mod  sics  ^  0 

If//r  is  constant,  split  can  be  computed  at  compile  time.  Otherwise,  it  is  computed  at  run  time.  In 
the  example  in  I'iyure  5,  offset  for  out  [1]  is  i,  so  if  iU’A  =  4,  then  split  =  4. 
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6.2  Siipeiword  IU*plju’emcnt 

Suporword  replacement  removes  redundant  loads  and  stores  of  supeword  sariables,  usiny  super¬ 
word  temporaries  instead.  We  assume  that  this  code  transformation  will  be  followed  by  register 
allocation  that  places  these  variables  in  registers,  for  e.xample,  in  figure  5(d)  and  (e),  the  store 
and  load  at  statements  10  and  1 1  can  both  be  eliminated,  and  vec4  can  be  used  in  place  of  vgc5 
in  subsequent  statements.  Superword  replacement  is  also  atVected  by  alignment,  in  that  we  detect 
redundant  loads  and  stores  by  identifying  distinct  memory  operations  that  refer  to  the  same  aligned 
superword,  even  if  the  addresses  are  not  identical. 

The  compiler  recognizes  opportunities  for  superword  replacement  by  determining  that  addresses 
and  offsets  for  dilTerent  memory  accesses  lit  within  the  same  supeivvord.  and  \erilies  that  there  are 
no  intervening  kills  to  the  memory  locations,  fhe  current  implementation  uses  value i:mihenng[25\ 
to  detect  such  opportunities.  Value  numbering  is  a  well-known  compiler  technique  for  detecting 
redundant  computation,  but  it  is  sensitive  to  operand  and  operator  ordering.  To  increase  the  success 
of  value  numbering,  we  first  preprocess  the  code  so  that  memory  access  operations  are  rewritten  into 
a  canonical  form,  constant  folding  has  been  applied  to  simplify  addresses,  and  alignment  is  taken 
into  account.  .\s  earlier  stated,  all  memory  accesses  are  aligned  at  superword  boundaries,  so  if  an 
unaligned  address  appears  in  a  memory  access,  the  resulting  access  will  be  aligned  to  the  preceding 
supenvord  boundary,  fhe  preprocessing  performs  this  alignment  in  software  so  that  redundant 
accesses  will  be  identified  by  value  numbering. 

The  current  implementation  of  superword  replacement  is  more  restrictive  than  what  was  pre¬ 
sented  in  Section  Value  numbering  operates  on  a  basic  bk>ck  at  a  time  so  we  cannot  exploit  reuse 
across  iterations  of  the  unrolled  loop  body.  I  his  is  because  we  are  performing  this  transfomiation 
alter  the  SI.P  compiler  has  llattened  the  loop  structure  to  gotos  and  labels,  fhe  dependence  infor¬ 
mation  used  to  perform  the  register  requirement  analysis  cannot  easily  be  recons tmcled  from  such 
low-level  code.  In  an  implementation  w  here  SI.P  and  Sl.l.  are  more  tightly  integrated,  it  should  be 
possible  to  perfonii  superword  replacement  as  a  byproduct  of  the  analysis  in  Section 

6..^  Packing  in  Siiperwoi  d  Registers 

.As  previously  described,  packing  in  memory  is  performed  to  align  superword  objects.  Memory 
packing  moves  data  elements  from  a  set  of  locations  in  memory  {sources)  to  a  superword  location 
{destination)  so  that  the  destination  superword  contains  contiguous  data,  aligned  to  a  supervvord 
boundary  or  to  another  operand.  I'or  example,  in  figure  5(e),  superword  variables  vecO  and  vgcI 
are  the  sources  and  supervvord  variable  vgc2  is  the  destination  for  memory  packing  in  lines  1-8. 

Our  implementation  perfonns  a  transformation  we  call  register  packing  to  optimize  memoiw 
packing  operations.  .A  series  of  memoiw  loads  and  stores  for  scalar  variables  are  replaced  by  su¬ 
perword  operations  on  registers,  as  shown  in  figure  5  (f).  We  identify  a  destination  as  a  superword 
data  type  that  is  the  target  of  a  series  of  scalar  store  instructions  into  its  tields.  such  as  vec2  in 
the  example,  fhe  conesponding  sources  are  identified  by  finding  preceding  loads  of  ihese  scalar 
variables.  If  the  inputs  to  these  loads  are  fields  of  superword  dala  types,  then  these  superwords  are 
the  sources.  In  the  example,  flatl  is  stored  into  a  field  of  vgc2,  and  ihere  is  a  preceding  load 
of  flatl  thal  copies  a  field  of  source  vgcO.  Once  we  find  such  a  patlern,  we  verify  fhe  safely  of 
this  tranformalion  by  guaranteeing  that  there  are  no  intervening  modifications  or  uses  of  either  the 
scalar  variables  or  destination  supenvords  between  loading  the  scalar  variables  and  completion  of 
storing  into  the  destination.  We  also  verify  that  the  destination  statements  ultimately  produce  con- 
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(a)  tempi  =  replicate  (a,  0)  P  =  shift_and_load(p,  tempi,  4) 


Figure  6:  Operations  used  for  packing  in  registers 


tiguous  data  in  the  superword.  We  deliiie  source  and  destination  indices  as  the  holds  in  tlie  source 
and  destination  supervvord  variables,  respectively.  For  example,  the  source  index  of  vecO  is  3  in 
line  I  of  the  example. 

Once  the  compiler  identifies  sources  and  destinations,  it  transforms  the  code  to  replace  memory 
accesses  with  operations  on  siiperword  registers.  The  register  packing  transfomiation  takes  advan¬ 
tage  of  two  instructions  that  are  common  in  multimedia  extension  architectures.  Replicate  replicates 
one  element  ofa  source  register  to  all  elements  of  a  temporary  output  register  (Figure  6(a)).  Shift- 
and-load  takes  two  input  registers,  fhe  first  input  register  is  a  temporary,  and  is  shifted  left  by 
the  number  of  bytes  specified  by  the  third  argument.  The  same  number  of  fields  is  taken  from  the 
second  input  register,  which  is  a  temporary  derived  from  a  source  superword,  to  fill  the  output  tem- 
porarv  register  (F'igure  6(b)).  Simply  stated,  we  are  shifting  each  source  element  into  the  destination 
superword,  in  order,  so  that  the  tinal  result  is  a  destination  superword  that  corresponds  to  contiguous 
aligned  data. 

The  steps  of  the  register  packing  transformation  are  as  follows. 

1.  We  sort  the  destination  statements  in  increasing  order  of  their  destination  indices.  We  then 
sort  the  source  statements  to  correspond  to  the  ordering  of  the  destination  statements,  so  that, 
for  example,  the  scalar  variable  associated  with  the  first  source  statement  is  the  same  as  the 
scalar  variable  asswiated  with  the  first  destination  statement. 

2.  For  each  source  statement,  in  sorted  order,  we  generate  a  replicate  statement  whose  two  in¬ 
puts  are  the  source  supeivvord  and  the  source  index,  and  the  output  is  a  supeivvord  tempo¬ 
rary.  For  example,  as  in  F'igure  5(1'),  we  have  replaced  line  I  of  Figure  5(e)  with  f< mjtl  = 
rt  jil icatr(i'C(i).  .3). 

5.  We  replace  each  destination  statement,  in  sorted  order,  with  a  shif  t_and_load  operation. 
1  he  first  input  is  the  destination  superword,  fhe  second  input  is  the  temporary  generated 
by  the  replicate  of  the  corresponding  source  statement.  The  third  argument,  the  shift 
amount,  usually  involves  shifting  by  a  single  supervvord  field.  For  the  last  destination  field, 
the  shift  amount  is  the  dilTerence.  in  bytes,  between  the  vws'  and  the  last  destination  field.  I'or 
completely  filled  destination  supeivvords.  it  will  also  be  just  a  single  field.  I'or  example,  in 
lines  1-8  of  Figure  5(e),  the  destination  supervvord  is  completely  tilled,  so  the  shift  amount  is 
always  a  single  4-byte  field.  In  lines  10-15,  however,  only  the  lirst  two  fields  are  filled,  so  the 
shift  amount  of  the  last  destination  statement  is  a  total  of  12  bytes 
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Figure  7:  Shitting 


4.  Soiirco  statements  are  deleted  if  the  scalar  \ariahles  are  not  live  beyond  the  coiresponding 
destination  statements. 

6.4  .\.n  Fxample:  Shifting  for  Partiai  Reuse 

Spatial  reuse  w  ithin  a  superword  happens  when  distinct  loop  iterations  access  dilTeront  data  in  the 
same  siiperword.  Partial  spatial  reuse  of  superwords  occurs  when  distinct  loop  iterations  access 
data  in  consecutive  superwords  in  memoiy,  partially  reusing  the  data  in  one  or  both  superwords, 
as  shown  by  the  e.xample  in  Figure  5  (a),  and  illustrated  graphically  in  Figure  7.  In  this  example, 
as  before  assuming  that  sirs  =  4.  array  reference  iu\i  —  j]  has  partial  spatial  reuse  in  loop  /'.  For 
a  fixed  value  of  i  and  y.  the  data  accessed  in  iteration  {i.  j)  consists  of  the  last  three  words  of  the 
supenvord  accessed  in  iteration  {\—  l.j).  plus  the  lirst  word  of  the  next  superword  in  memoiy  This 
type  of  reuse  can  be  exploited  by  shifting  the  first  word  out  of  the  superword,  and  shifting  in  the 
next  word,  as  in  i'igure  7.  .\s  partially  shown  in  I'igure  5(c)  and  (0.  only  four  supenvords  need  to  be 
loaded  for  the  data  accessed  in  the  (4  copies  of /7/[(  -  j]  in  the  loop  body,  after  shifting  is  applied. 
Before  shifting.  Bij;  -  j\  had  to  be  loaded  from  memory  (and  possibly  aligned)  for  each  of  the  four 
copies  of  ;7([/  -  j]  in  the  loop  body. 

I  his  shifting  oppoilunity  arises  frequently  in  both  signal  and  image  processing  applications, 
w  here  one  object  is  compared  to  a  subcomponent  of  another  object,  such  as  the  example  in  l-ig- 
Lire  5(a).  We  detect  these  opportunities  through  the  analysis  described  in  Section  s.  The  optimization 
shown  in  Figure  7  falls  out  from  the  combination  of  unroll-and-jam.  alignment  operations  generated 
by  the  SI. I’  compiler,  supenvord  replacement  and  register  packing. 


7.  Kxpcrimciital  Results 

I  his  section  presents  an  ex|vriment  that  demonstrates  the  dramatic  performance  improvements  that 
can  be  derived  from  compiler-controlled  caching  in  superword  registers.  We  describe  an  implemen¬ 
tation  that  incorporates  superword  register  locality  optimizations  into  an  existing  compiler  exploit¬ 
ing  superword-level  parallelism  1 1 1.  We  present  a  set  of  results  on  four  multimedia  kernels  and  two 
scientific  applications,  derived  automatically  from  our  implementation. 
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7.1  Iniplemoiilatioii  and  ^leth(Klolo^,^ 

Figure  8  illustrates  the  system  we  have  developed  for  this  experiment,  which  uses  the  Stanford 
SUIF  compiler  as  its  underlying  infrastructure  [26|.  I  he  input  to  the  system  is  a  C'  program, 
which  is  then  optimized  by  passes  in  SIJIF,  including  our  Supervvord  l.iKality  analysis  described 
in  Section  3,  followed  by  the  Supervvord-I.evel  Parallelism  (SI.P)  optimization  passes  by  Larsen 
and  Amarasinghel  I  ],  and  finally,  an  optimization  pass  that  performs  superword  replacement  as  de¬ 
scribed  in  Section  6.2  to  steer  the  compiler  to  obtain  the  reuse  in  superword  registers  that  the  SLL 
algorithm  determined  was  possible. 

I  his  ordering  of  passes  was  selected  primarily  for  implementation  coinenience,  since  we  w  ere 
building  on  the  existing  SI.P  compiler  implementation,  fhe  SI.P  passes  operate  on  the  code  at  a 
low  level,  where  it  is  dilTicult  to  reconstruct  the  loop  stmcture  and  array  access  expressions,  fhus, 
register  requirement  analysis  and  unroll-and-jam  were  applied  prior  to  SI.P,  rather  than  afterward, 
as  was  suggested  by  the  examples  in  Section  2.  Supervvord  replacement  must  follow  SI.P,  which  is 
the  reason  the  components  of  our  algorithm  are  performed  on  either  side  of  SI.P.  Note  that  both  the 
SI.P  passes  and  SLL  employ  loop  unrolling,  but  for  dilTerent  reasons,  fhe  SI.P  compiler  operates 
on  basic  blocks  and  unrolls  the  innermost  loop  of  a  loop  nest  to  convert  loop-level  parallelism  into 
basic-block  parallelism,  fhe  SLL  algorithm  perfomis  unroll-and-jam  to  expose  locality  in  basic 
blocks.  I  lowever,  the  loop  that  cairies  the  most  spatial  locality  at  the  supervvord  level  is  often  the 
loop  that  canies  the  most  superword-level  parallelism,  fherefore,  it  is  a  reasonable  choice  to  use 
the  SLL  algorithm  to  expose  both  parallelism  and  localitv-  in  the  loop  body  while  suppressing  the 
unrolling  originally  performed  by  the  SI.P  compiler. 


C  program 


figure  8:  Implementation. 

fhe  output  from  the  SUlf  portion  of  the  system  is  an  optimized  C  program,  augmented  w  ith  spe¬ 
cial  supervv  ord  data  types  and  operations.  Currently,  the  resulting  code  is  passed  to  a  Cinu  ('  backend, 
modified  to  support  supeivvord  data  types  and  operations  for  the  PowerPC  AltiVec  instruction-set 
architecture  extensions,  flach  supervvord  operation  corresponds,  in  most  cases,  to  a  single  instruc¬ 
tion  in  the  .MtiVec  ISA.  fhe  role  of  the  CiCC  backend  includes  replacing  the  vector  operations  with 
the  corresponding  AltiVec  supervvord  instmction,  and  allocating  the  vector  data  types  to  the  super- 
word  registers,  fhe  resulting  code  is  executed  on  a  5.3.3  MHz  Macintosh  PowerPC  Cj4,  which  has  a 
supervvord  register  file  consisting  of  .32  128-bit  registers. 

7.2  Perfoi  niancc  Measiiiemoiits 

We  have  applied  the  previously-described  implementation  to  four  of  the  live  multimedia  kernels  and 
the  two  scientific  programs  from  the  Specfp95  benchmark  suite  for  which  execution  time  speedups 
were  reported  in  Larsen  and  .\marasinghe,  summarized  in  fable  2  1 1 1.  .As  a  first  step,  we  veritied 
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Name 

Description 

Data  Width 

Input  Size 

\  MM 
IIR 

YUV 

MMM 

swim 

tomcat  V 

Vector-matrix  multiply 
Finite  impulse  response  filter 
RCilJ  to  YUV  conversion 
Matrix-matrix  multiply 
Shallow  water  model 

Mesh  generation 

32-bit  lloat 
32-bit  lloat 
16-bil  integer 
32-bit  lloat 
32-bil  lloat 

32-bit  lloat 

512  elements 

256  filler,  1 M  signal 

32K  elements 

512  elements 
Specfp95  reference  input 
Specfp95  reference  input 

Table  2:  Benchmark  programs. 


that  we  could  reproduce  their  previously  reported  results,  l-or  purposes  of  comparison,  we  initially 
followed  the  same  methodology  established  in  Larsen  and  .\marasinghe  1 1 1:  ( 1 )  wc  used  the  same 
programs;  (2)  all  versions  of  the  code  were  compiled  on  the  .'\lti\ec  without  optimization;  and.  (3) 
baseline  measurements  were  derived  by  compiling  the  unparallelized  code  for  the  PowerPC  Cj4.  We 
are  using  an  updated  implementation  of  Sl.l’  from  w  hat  was  published,  as  well  as  a  faster  target 
machine  and  new  releases  of  GCC  and  the  i.inux  operating  system,  so  there  are  some  dilTerences  in 
results,  but  they  are  very  minor. 

Larsen  and  Amarasinghe  were  unable  to  use  optimization  on  the  .\lti\ec-extended  GCC  back¬ 
end  at  the  time  of  their  study,  but  in  the  intervening  time,  this  Motorola-supplied  backend  has  be¬ 
come  more  robust.  For  the  results  presented  in  this  section,  we  modify  the  methodology  to  perform 
“-O.V  optimizations.  To  undei'stand  the  overall  benefits  of  exploiting  compiler-controlled  caching 
in  superword  registers,  we  have  compared  the  results  of  the  full  system  with  those  obtained  w  hen 
SLP  is  used  alone.  For  this  reason,  we  report  results  where  SI.P  is  applied  to  the  original  codes  and 
compare  these  results  to  the  full  system. 

We  show  three  sets  of  results.  First,  in  [able  3,  we  show  the  number  of  vector,  scalar  and  total 
memory  accesses  for  the  baseline  and  the  full  system.  Our  approach  eliminates  from  .38“ii  to  (w)“ci 
of  the  vector  loads  and  stores  in  the  four  kernels,  and  over  85"  o  in  SWIM  and  FOMCAFV.  We  also 
eliminate  over  W"u  of  the  scalar  loads  and  stores  in  the  four  kernels,  and  over  35" o  in  SWIM  and 
FOMC.VFV  using  register  packing,  as  described  in  Section  6.3.  When  combined,  more  than  50"/o 
of  memory  accesses  are  eliminated. 

Figure  9  shows  how  these  reductions  in  instructions  translates  into  speediips  over  SLP.  To  isolate 
the  benetits  of  individual  components  of  our  system,  we  measure  the  performance  of  the  code  at 
several  stages  of  the  optimization  process.  I'he  first  bar,  normalized  to  1,  shows  the  results  of 
SLP  alone.  I  he  second  bar,  called  Unrolled  i  SLP,  shows  the  results  of  running  the  fust  portion 
of  the  SLL  algorithm,  described  in  Section  3,  w  hich  performs  unroll-and-jam  on  the  loop  nest  to 
expose  oppoilunities  for  superword  reuse,  and  following  up  with  SLP.  1  his  bar  isolates  the  impact 
of  unrolling,  since  it  is  not  until  after  SLP  that  this  reuse  is  actually  exploited.  .Mso,  because  it  is 
reordering  the  iteration  space  to  bring  reuse  closer  together,  this  version  will  also  obtain  locality 
benefits  in  the  data  cache.  I  hus,  this  bar  provides  the  cache  locality  benefits  of  unroll-and-jam, 
which  can  be  compared  against  the  additional  improvements  from  superword  register  locality.  The 
third  bar.  Superword  Replacement,  provides  speedup  using  supenvord  replacement,  as  described  in 
Section  6.2.  Fhe  final  bar,  entitled  Register  Packing,  shows  the  additional  improvement  due  to  this 
technique,  described  in  Section  6.3. 
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Name 

Mem.  .'\cc 

SLP  on  l>i:  baseline) 

SLP+SLLt^RegPack 

Rcill'  '\cdl%  i 

.101,989,888 

0 

100.00 

VMM 

1 00,66.1.297 

50.462,723 

402,653.185 

50.462.723 

Scalar 

1.1 13,940.672 

82.031,104 

FtR 

Vector 

196.558.849 

120.631.297 

Total 

1.310.499.521 

202.662,401 

Scalar 

9,400 

0 

YUV 

Vector 

52.428.801 

23,756,801 

Total 

52.438.201 

23,756,801 

Scalar 

135.267..328 

525,312 

99.61 

MMM 

Vector 

167,772.161 

50,397,187 

69.96 

Total 

303,039.489 

50.922.499 

83.20 

17,150,342.657 

8,920.336,007 

47.99 

swim 

8,495,723,139 

1,200,754,698 

85.87 

Total 

25,646,065,796 

10,121.090,705 

60.54 

Scalar 

599.038,032 

384,070,586 

35.89 

tomcatv 

Vector 

284.631,621 

9,915,592 

96.51 

Total 

883.669,653 

393,986.178 

55.41 

Table  y.  'The  number  of  dv  iiamic  memorv'  accesses. 


Overall,  we  see  that  in  combination,  applications  achieve  spoodiips  between  1.3  and  3.1  over 
SI. I’  alone,  with  an  averajie  oT2.2X.  Consideration  oT  I'OMCWTV  and  SWIM  shows  that  both  pro- 
yranis  have  little  temporal  reuse,  although  there  is  a  small  amount  of  spatial  reuse  that  is  exploited 
with  our  approach,  particularly  in  TOMCATV.  We  are  obtaining  a  locality  benefit  due  to  unroll-and- 
jam.  We  also  observe  additional  SI. I’  due  to  index  set  splitting,  motivated  by  the  need  to  create  a 
steady-state  loop  w  here  the  data  is  aligned  to  a  supenvord  boundary.  The  four  other  programs  show 
a  significant  improvement  from  supenvord  replacement.  Tor  \  MM,  MMM  and  TIR,  there  are  also 
huge  gains  due  to  register  packing. 

In  Figure  10,  we  further  explore  the  relationship  between  superword  replacement  and  register 
packing.  The  fust  bar,  which  is  normali/ed  to  I,  shows  the  Unrolled i SI. 1’  version  (the  second  bar 
in  the  previous  figure).  The  second  bar  is  the  Unrolled  iSI.PiSWR  result  from  the  previous  figure, 
hut  this  time  it  is  normali/ed  to  Unrolled t  SI. R  To  show  the  isolated  benefit  of  register  packing 
w  ithout  supenvord  replacement,  we  applied  register  packing  to  the  Tlnrolli  SI.P  version,  obtaining 
the  results  shown  in  the  third  bar  (Unroll  i  SI. P  i  RP)  of  Figure  10.  The  final  bar  is  the  result  of 
applying  all  of  the  optimi/ations.  .As  might  be  expected  from  the  previous  figure,  register  packing, 
either  in  isolation  or  in  conjunction  with  supervvord  replacement,  does  not  impact  the  results  for 
"NTIV,  swim  or  tomcatv.  We  see  that  for  VMM  and  MMM,  register  packing  yields  about  the  same 
improvement  when  applied  prior  to  superword  replacement  than  afterward.  lispecially  interesting 
are  the  results  for  FIR,  because  the  speedup  is  much  larger  when  superword  replacement  and  register 
packing  are  applied  together  than  when  they  are  applied  separately.  On  further  investigation,  we 
found  that  the  Unroll  t  SI. P  i  RP  version  sutVered  from  register  spilling.  Superword  replacement 
removes  the  majority  of  the  superword  variables  used  in  the  Unroll  i  Sl.P  i  RP  version,  which  in  turn 
reduces  register  pressure.  This  result  is  consistent  with  the  goal  of  the  algorithm  in  Section  3.  We 
selected  unroll  factors  based  on  the  asstimption  that  supervvord  replacement  would  be  performed. 
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l  iyiiro  Speediips  over  SI, I’  alone.  I'i^ure  10;  Impact  of  register  packing. 


Without  siiperword  replacement,  there  is  register  pressure  after  unrolling,  and  this  is  magnified  hy 
register  packing  because  it  introduces  additional  superword  \ariahles. 

In  summaiy,  the  Sl.l.  techniques  presented  in  this  paper  dramatically  reduce  the  luimher  of 
memory  accesses  and  yield  significant  performance  improvements  across  these  6  programs.  I  hus, 
this  paper  has  demonstrated  the  value  of  exploiting  locality  in  supeivvord  registers  in  architectures 
that  support  superword-level  parallelism  such  as  the  .MtiVec. 


8.  Related  Research 

For  well  over  a  decade,  a  significant  body  of  research  has  been  devoted  to  code  transformations 
to  improve  cache  locality,  most  of  it  targeting  loop  nests  w  ith  regular  data  access  patterns  |27,  28, 
2^),  2()|.  I.oop  optimizations  for  improving  data  locality,  such  as  tiling,  interchanging  and  skew  ing, 
focus  on  reducing  cache  capacity  misses.  Of  particular  rekwance  to  this  paper  are  approaches  to 
tiling  for  cache  to  exploit  temporal  and  spatial  reuse;  the  bulk  of  this  work  examines  how  to  select 
tile  sizes  that  eliminate  both  capacity  misses  and  conllict  misses,  tuned  to  the  problem  and  cache 
sizes  l.'^l,  11,  12,  lo,  14,  15,  16,  17,  18, 42|.  The  key  dilVerence  between  our  work  and  that  of  tiling 
for  caches  is  that  interference  is  not  an  issue  in  registers.  I  herefore,  models  that  consider  conllict 
misses  are  not  appropriate.  Further,  our  code  generation  strategy  must  explicitly  manage  reuse  in 
registers. 

There  has  been  much  less  attention  paid  to  tiling  and  other  code  transformations  to  exploit  reuse 
ill  registei's,  w  here  conllict  misses  do  not  occur,  but  registei's  must  be  explicitly  named  and  managed. 
.A  few  approaches  examine  mapping  anay  variables  to  scalar  registers  1 18,  54,  20|.  Most  closely 
related  to  ours  is  the  work  by  Carr  and  Kennedy,  which  uses  scalar  replacement  and  unroll-and- 
jam  to  exploit  scalar  register  reuse  |  ft)|.  Like  our  approach,  in  deriving  the  unroll  factors,  they 
use  a  model  to  count  the  number  of  registei^;  required  for  a  potential  unrolling  to  avoid  register 
pressure,  and  they  replace  amay  accesses,  which  would  result  in  memory  accesses,  w  ith  accesses  to 
temporaries  that  will  be  put  in  registers  by  the  backend  compiler.  Their  search  for  an  unroll  factor 
is  constrained  by  register  pressure  and  another  metric  called  balance  that  matches  memory  access 
time  to  floating  point  computation  time.  Our  approach  is  distinguished  from  all  these  others  in  that 
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the  model  for  register  requirements  must  take  spatial  locality  into  account,  we  replace  array  accesses 
with  siiperwords  rather  than  scalars,  and  we  also  consider  the  optimizations  in  light  of  siiperword 
parallelism. 

fhere  are  several  recent  compilation  systems  developed  for  superword-level  parallelism  1 1,  7, 

8,  9,  10],  Most,  including  also  commercial  compilers  |.'4,  .i5|,  are  based  on  vectorization  technol- 
ogy  1 7,  9|.  In  contrast,  Larsen  and  .Amarasinghe  devised  a  supenvord-level  parallelization  system 
for  multimedia  extensions  [1|.  fhey  point  out  that  there  are  many  dilTerences  between  the  multi- 
media  extension  architectures  and  vector  architectures,  such  as  short  vectors,  ease  of  mixing  with 
scalar  instructions,  and  need  for  alignment  of  memory  accesses  |46|.  They  argue  that  their  aigtv 
rithm  for  finding  supeiAvord-level  parallelism  from  a  basic  block  instead  of  a  loop  nest  is  much  more 
etfective  than  using  vectorization-based  techniques.  None  of  the  above  approaches  exploit  reuse  in 
the  superword  register  file. 

9.  Conclusion 

fhis  paper  presents  an  algorithm  for  compiler-controlled  caching  in  supenvord  register  liles.  The 
algorithm  is  applicable  to  multimedia  extensions  such  as  Intel’s  SSL,  PowerPC’s  .AltiVec,  and  also 
to  Processor-in-memory  (PIM)  architectures  w  ith  support  for  superword  operations. 

We  implemented  our  approach  in  an  existing  compiler  targeting  superword-level  parallelism. 
We  presented  experimental  results,  derived  automatically,  comparing  the  performance  of  six  bench¬ 
marks multimedia  kernels  optimized  for  parallelism  only,  using  SI  P,  and  optimized  for  both  paral¬ 
lelism  and  locality.  Our  results  show  speediips  ranging  from  1..^  to  .^.I.X,  and  an  average  of  2.2X, 
on  the  6  programs  as  compared  to  using  Sl.P  alone,  and  most  memory  accesses  are  removed. 

I  he  approach  taken  here  that  separates  optimizations  for  SI.l.  and  SI  P  is  convenient  for  imple¬ 
mentation  purposes,  since  we  are  building  upon  the  w  ork  of  others.  Lurther,  as  there  are  now  a  few- 
other  compilers  that  exploit  superw  ord- level  parallelism  [7, 8, 9, 10|,  the  same  can  be  used  to  extend 
these  existing  systems  to  incorporate  compiler-controlled  caching  in  superword  registers.  Ideally, 
however,  an  optimizer  that  integrates  the  supeiword  parallelism  and  locality  techniques  could  be 
even  more  elTective.  Lor  example,  in  a  combined  algorithm,  selection  of  w  hich  loops  to  parallelize 
could  also  take  superw  ord-level  locality  into  account.  A  combined  algorithm  is  the  subject  of  future 
work. 
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